AWS Machine Learning Blog

Using HAQM Textract with AWS PrivateLink

HAQM Textract now supports HAQM Virtual Private Cloud (HAQM VPC) endpoints via AWS PrivateLink so you can securely initiate API calls to HAQM Textract from within your VPC and avoid using the public internet.

In this post, we show you how to access HAQM Textract APIs from within your VPC without traversing the public internet, and how to use VPC endpoint policies to restrict access to HAQM Textract.

HAQM Textract is a fully managed machine learning (ML) service that automatically extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

You can use AWS PrivateLink to access HAQM Textract securely by keeping your network traffic within the AWS network, while simplifying your internal network architecture. It enables you to privately access HAQM Textract APIs from your VPC in a scalable manner by using interface VPC endpoints. A VPC endpoint is an elastic network interface in your subnet with a private IP address that serves as the entry point for all HAQM Textract API calls. A VPC endpoint enables you to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC don’t require public IP addresses to communicate with resources in the service. Traffic between your VPC and the other service doesn’t leave the AWS network.

The following diagram illustrates the solution architecture.

Prerequisites

To get started, you need to have a VPC set up in the AWS Region of your choice. For instructions, see Getting started with HAQM VPC. In this post, we use the us-east-2 Region. You should also have an AWS account with sufficient access to create resources in the following services:

  • HAQM Textract
  • AWS PrivateLink

Solution overview

The walkthrough includes the following high-level steps:

  1. Create VPC endpoints.
  2. Use HAQM Textract via AWS PrivateLink.

Creating VPC endpoints

To create a VPC endpoint, complete the following steps. We use the us-east-2 Region in this post, so the console and URLs may differ depending on the Region you choose.

  1. On the HAQM VPC console, choose Endpoints.
  2. Choose Create Endpoint.
  3. For Service category, select AWS services.
  4. For Service Name, choose amazonaws.us-east-2-textract or com.amazonaws.us-east-2.textract-fips.
  5. For VPC, enter the VPC you want to use.
  6. For Availability Zone, select your preferred Availability Zones.
  7. For Enable DNS name, select Enable for this endpoint.

This creates a private hosted zone that enables you to access the resources in your VPC using custom DNS domain names, such as example.com, instead of using private IPv4 addresses or private DNS hostnames provided by AWS. The HAQM Textract DNS hostname that the AWS Command Line Interface (AWS CLI) and HAQM Textract SDKs use by default (http://textract.Region.amazonaws.com) resolves to your VPC endpoint.

  1. For Security group, choose the security group to associate with the endpoint network interface.

If you don’t specify a security group, the default security group for your VPC is associated.

  1. Choose Create Endpoint.

When the Status changes to available, your VPC endpoint is ready for use.

  1. Choose the Policy tab to apply more restrictive access control to the VPC endpoint.

The following example policy limits VPC endpoint access to only the DetectDocumentText API. An IAM principal, even with access to all Textract APIs, can still only access the specific API in the following policy using this VPC endpoint. This is an additional layer of access control applied at the VPC endpoint. You should apply the principle of least privilege when defining your own policy. For more information, see Controlling access to services with VPC endpoints.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "textract:DetectDocumentText"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow",
            "Principal": "*"
        }
    ]
}

Now that you have set up your VPC endpoint, the following section shows you how to access HAQM Textract APIs from within that VPC using AWS PrivateLink.

Accessing HAQM Textract APIs via AWS PrivateLink

After you set up the relevant VPC endpoint policies, you have two options to configure endpoints in order to access HAQM Textract APIs:

The following code is an example AWS CLI command to run from within the VPC:

$ aws textract detect-document-text --document '{"S3Object":{"Bucket":"textract-test-bucket","Name":"example-doc.jpg"}}' --region us-east-2
  • You can also use the DNS name that was generated when creating the VPC endpoint. These DNS names are in the form of *.us-east-2.vpce.amazonaws.com or *.textract-fips.us-east-2.vpce.amazonaws.com. For example: vpce-0f1aa01f0ce676709-il663k5n.textract.us-east-2.vpce.amazonaws.com.

The following code is an example AWS CLI command to run from within the VPC:

aws textract detect-document-text --document '{"S3Object":{"Bucket":"textract-test-bucket","Name":"example-doc.jpg"}}' --region us-east-2 --endpoint http://vpce-05e9d346575f9cb38-1wdh6mi2.textract.us-east-2.vpce.amazonaws.com

Conclusion

You now have successfully configured a VPC endpoint for HAQM Textract in your AWS account. Traffic to HAQM Textract APIs from that VPC endpoint are only within the AWS network. The VPC endpoint policy you configured further allows you to restrict which HAQM Textract APIs are accessible from within that VPC.


About the Author

Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.

 

 

 

Thomas joined HAQM Web Services in 2016 initially working on Application Auto Scaling before moving into this current role at Textract. Before joining AWS, he worked in engineering roles in the domains of computer graphics and networking. Thomas holds a master’s degree in engineering from the university of Leuven in Belgium.