AWS Machine Learning Blog

Securing HAQM Comprehend API calls with AWS PrivateLink

HAQM Comprehend now supports HAQM Virtual Private Cloud (HAQM VPC) endpoints via AWS PrivateLink so you can securely initiate API calls to HAQM Comprehend from within your VPC and avoid using the public internet.

HAQM Comprehend is a fully managed natural language processing (NLP) service that uses machine learning (ML) to find meaning and insights in text. You can use HAQM Comprehend to analyze text documents and identify insights such as sentiment, people, brands, places, and topics in text. No ML expertise required.

Using AWS PrivateLink, you can access HAQM Comprehend easily and securely by keeping your network traffic within the AWS network, while significantly simplifying your internal network architecture. It enables you to privately access HAQM Comprehend APIs from your VPC in a scalable manner by using interface VPC endpoints. A VPC endpoint is an elastic network interface in your subnet with a private IP address that serves as the entry point for all HAQM Comprehend API calls.

In this post, we show you how to set up a VPC endpoint and enforce the use of this private connectivity for all requests to HAQM Comprehend using AWS Identity and Access Management (IAM) policies.

Prerequisites

For this example, you should have an AWS account and sufficient access to create resources in the following services:

Solution overview

The walkthrough includes the following high-level steps:

  1. Deploy your resources.
  2. Create VPC endpoints.
  3. Enforce private connectivity with IAM.
  4. Use HAQM Comprehend via AWS PrivateLink.

Deploying your resources

For your convenience, we have supplied an AWS CloudFormation template to automate the creation of all prerequisite AWS resources. We use the us-east-2 Region in this post, so the console and URLs may differ depending on the Region you select. To use this template, complete the following steps:

  1. Choose Launch Stack:
  2. Confirm the following parameters, which you can leave at the default values:
    1. SubnetCidrBlock1 – The primary IPv4 CIDR block assigned to the first subnet. The default value is 10.0.1.0/24.
    2. SubnetCidrBlock2 – The primary IPv4 CIDR block assigned to the second subnet. The default value is 10.0.2.0/24.
  3. Acknowledge that AWS CloudFormation may create additional IAM resources.
  4. Choose Create stack.

The creation process should take roughly 10 minutes to complete.

The CloudFormation template creates the following resources on your behalf:

  • A VPC with two private subnets in separate Availability Zones
  • VPC endpoints for private HAQM S3 and HAQM Comprehend API access
  • IAM roles for use by Lambda and HAQM Comprehend
  • An IAM policy to enforce the use of VPC endpoints to interact with HAQM Comprehend
  • An IAM policy for HAQM Comprehend to access data in HAQM S3
  • An S3 bucket for storing open-source data

The next two sections detail how to manually create a VPC endpoint for HAQM Comprehend and enforce usage with an IAM policy. If you deployed the CloudFormation template and prefer to skip to testing the API calls, you can advance to the Using HAQM Comprehend via AWS PrivateLink section.

Creating VPC endpoints

To create a VPC endpoint, complete the following steps:

  1. On the HAQM VPC console, choose Endpoints.
  2. Choose Create Endpoint.
  3. For Service category, select AWS services.
  4. For Service Name, choose amazonaws.us-east-2.comprehend.
  5. For VPC, enter the VPC you want to use.
  6. For Availability Zone, select your preferred Availability Zones.
  7. For Enable DNS name, select Enable for this endpoint.

This creates a private hosted zone that enables you to access the resources in your VPC using custom DNS domain names, such as example.com, instead of using private IPv4 addresses or private DNS hostnames provided by AWS. The HAQM Comprehend DNS hostname that the AWS Command Line Interface (CLI) and HAQM Comprehend SDKs use by default (http://comprehend.Region.amazonaws.com) resolves to your VPC endpoint.

  1. For Security group, choose the security group to associate with the endpoint network interface.

If you don’t specify a security group, the default security group for your VPC is associated.

  1. Choose Create Endpoint.

When the Status changes to available, your VPC endpoint is ready for use.

  1. Choose the Policy tab to apply more restrictive access control to the VPC endpoint.

The following example policy limits VPC endpoint access to an IAM role used by a Lambda function in our deployment. You should apply the principle of least privilege when defining your own policy. For more information, see Controlling access to services with VPC endpoints.

{                        
    "Version": "2012-10-17",
    "Statement": [    
        {                
            "Action": [
                "comprehend:DetectEntities",
                "comprehend:CreateDocumentClassifier"
            ],           
            "Resource": [
                "*"  
            ],           
            "Effect": "Allow",
            "Principal": {
                "AWS": [
"arn:aws:iam::#########:role/ComprehendPrivateLink-LambdaExecutionRole"                                         
                ]    
            }            
        }                
    ]                    
}

Enforcing private connectivity with IAM

To allow or deny access to HAQM Comprehend based on the use of a VPC endpoint, we include an aws:sourceVpce condition in the IAM policy. The following example policy provides access specifically to the DetectEntities and CreateDocumentClassifier APIs only when the request utilizes your VPC endpoint. You can include additional HAQM Comprehend APIs in the “Action” section of the policy or use “comprehend:*” to include them all. You can attach this policy to an IAM role to enable compute resources hosted within your VPC to interact with HAQM Comprehend.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ComprehendEnforceVpce",
            "Effect": "Allow",
            "Action": [
                "comprehend:CreateDocumentClassifier",
                "comprehend:DetectEntities"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:SourceVpce": "vpce-xxxxxxxx"
                }
            }
        },
        {
            "Sid": "PassRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::#########:role/ComprehendDataAccessRole"
        }
    ]
}

You should replace the VPC endpoint ID with the endpoint ID you created earlier. Permission to invoke the PassRole API is required for asynchronous operations in HAQM Comprehend like CreateDocumentClassifer and should be scoped to your specific data access role.

Using HAQM Comprehend via AWS PrivateLink

To start using HAQM Comprehend with AWS PrivateLink, you perform the following high-level steps:

  1. Review the Lambda function for API testing.
  2. Create the DetectEntities test event.
  3. Train a custom classifier.

Reviewing the Lambda function

To review your Lambda function, on the Lambda console, choose the Lambda function that contains ComprehendPrivateLink in its name.

The VPC section of the Lambda console provides links to the various networking components automatically created for you during the CloudFormation deployment.

The function code includes a sample program that takes user input to invoke the specific HAQM Comprehend APIs supported by our example IAM policy.

Creating a test event

In this section, we create an event to detect entities within sample text using a pretrained model.

  1. From the Test drop-down menu, choose Create new test event.
  2. For Event name, enter a name (for example, DetectEntities).
  3. Replace the event JSON with the following code:
    {
      "comprehend_api": "DetectEntities",
      "language_code": "en",
      "text": "HAQM.com, Inc. is located in Seattle, WA and was founded July 5th, 1994 by Jeff Bezos, allowing customers to buy everything from books to blenders."
    }
  4. Choose Save to store the test event.
  5. Choose Save to update the Lambda function.
  6. Choose Test to invoke the DetectEntities API.

The response should include results similar to the following code:

{
    "Entities": [
        {
            "Score": 0.9266431927680969,
            "Type": "ORGANIZATION",
            "Text": "HAQM.com, Inc.",
            "BeginOffset": 0,
            "EndOffset": 16
        },
        {
            "Score": 0.9952651262283325,
            "Type": "LOCATION",
            "Text": "Seattle, WA",
            "BeginOffset": 31,
            "EndOffset": 42
        },
        {
            "Score": 0.9998188018798828,
            "Type": "DATE",
            "Text": "July 5th, 1994",
            "BeginOffset": 59,
            "EndOffset": 73
        },
        {
            "Score": 0.9999810457229614,
            "Type": "PERSON",
            "Text": "Jeff Bezos",
            "BeginOffset": 77,
            "EndOffset": 87
        }
    ]
}

You can update the test event to identify entities from your own text.

Training a custom classifier

We now demonstrate how to build a custom classifier. For training data, we use a version of the Yahoo answers corpus that is preprocessed into the format expected by HAQM Comprehend. This corpus, available on the AWS Open Data Registry, is cited in the paper Text Understanding from Scratch by Xiang Zhang and Yann LeCun. It is also used in the post Building a custom classifier using HAQM Comprehend.

  1. Retrieve the training data from HAQM S3.
  2. On the HAQM S3 console, choose the example S3 bucket created for you.
  3. Choose Upload and add the file you retrieved.
  4. Choose the uploaded object and note the Key.
  5. Return to the test function on the Lambda console.
  6. From the Test drop-down menu, choose Create new test event.
  7. For Event name, enter a name (for example, TrainCustomClassifier).
  8. Replace the event input with the following code:
    {
      "comprehend_api": "CreateDocumentClassifier",
      "custom_classifier_name": "custom-classifier-example",
      "language_code": "en",
      "training_data_s3_key": "comprehend-train.csv"
    }
  9. If you changed the default file name, update the training_data_s3_key to match.
  10. Choose Save to store the test event.
  11. Choose Save to update the Lambda function.
  12. Choose Test to invoke the CreateDocumentClassifier API.

The response should include results similar to the following code:

{
"DocumentClassifierArn": "arn:aws:comprehend:us-east-2:0123456789:document-classifier/custom-classifier-example"
}
  1. On the HAQM Comprehend console, choose Custom classification to check the status of the document classifier training.

After approximately 20 minutes, the document classifier is trained and available for use.

Cleaning Up

To avoid incurring future charges, delete the resources you created during this walkthrough after concluding your testing.

  1. On the HAQM Comprehend console, delete the custom classifier.
  2. On the HAQM S3 console, empty the bucket created for you.
  3. If you launched the automated deployment, on the AWS CloudFormation console, delete the appropriate stack.

The deletion process takes approximately 10 minutes.

Conclusion

You have now successfully invoked HAQM Comprehend APIs using AWS PrivateLink. The use of IAM policies prevents requests from leaving your VPC and further improves your security posture. You can extend this solution to securely test additional features like HAQM Comprehend custom entity recognition real-time endpoints.

All HAQM Comprehend API calls are now supported via AWS PrivateLink. This feature exists in all commercial Regions where AWS PrivateLink and HAQM Comprehend are available. To learn more about securing HAQM Comprehend, see Security in HAQM Comprehend.


About the Authors

Dave Williams is a Cloud Consultant for AWS Professional Services. He works with public sector customers to securely adopt AI/ML services. In his free time, he enjoys spending time with his family, traveling, and watching college football.

 

 

 

Adarsha Subick is a Cloud Consultant for AWS Professional Services based out of Virginia. He works with public sector customers to help solve their AI/ML-focused business problems. In his free time, he enjoys archery and hobby electronics.

 

 

 

Saman Zarandioon is a Sr. Software Development Engineer for HAQM Comprehend. He earned a PhD in Computer Science from Rutgers University.