AWS Machine Learning Blog

Store output in custom HAQM S3 bucket and encrypt using AWS KMS for multi-page document processing with HAQM Textract

HAQM Textract is a fully managed machine learning (ML) service that makes it easy to process documents at scale by automatically extracting printed text, handwriting, and other data from virtually any type of document. HAQM Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. This enables businesses across many industries, including financial, medical, legal, and real estate, to easily process large numbers of documents for different business operations. Healthcare providers, for example, can use HAQM Textract to extract patient information from an insurance claim or values from a table in a scanned medical chart without requiring customization or human intervention. The blog post Automatically extract text and structured data from documents with HAQM Textract shows how to use HAQM Textract to automatically extract text and data from scanned documents without any machine learning (ML) experience.

HAQM Textract provides both synchronous and asynchronous API actions to extract document text and analyze the document text data. You can use synchronous APIs for single-page documents and low latency use cases such as mobile capture. Asynchronous APIs can process single-page or multi-page documents such as PDF documents with thousands of pages.

In this post, we show how to control the output location and the AWS Key Management Service (AWS KMS) key used to encrypt the output data when you use the HAQM Textract asynchronous API.

HAQM Textract asynchronous API

HAQM Textract provides asynchronous APIs to extract text and structured data in single-page (jpeg, png, pdf) or multi-page documents that are in PDF format. Processing documents asynchronously allows your application to complete other tasks while it waits for the process to complete. You can use StartDocumentTextDetection and GetDocumentTextDetection to detect lines and words in a document or use StartDocumentAnalysis and GetDocumentAnalysis to detect lines, words, forms, and table data from a document.

The following diagram shows the workflow of an asynchronous API action. We use AWS Lambda as an example of the compute environment calling HAQM Textract, but the general concept applies to other compute environments as well.

  1. You start by calling the StartDocumentTextDetection or StartDocumentAnalysis API with an HAQM Simple Storage Service (HAQM S3) object location that you want to process, and a few additional parameters.
  2. HAQM Textract gets the document from the S3 bucket and starts a job to process the document.
  3. As the document is processed, HAQM Textract’s S3 bucket saves and encrypts the inference results and notifies you using an HAQM Simple Notification Service (HAQM SNS) topic.
  4. You can then call the corresponding GetDocumentTextDetection or GetDocumentAnalysis API to get the results in JSON format.

Store and encrypt output of asynchronous API in custom S3 bucket

When you start an HAQM Textract job by calling StartDocumentTextDetection or StartDocumentAnalysis, an optional parameter in the API action is called OutputConfig. This parameter allows you to specify the S3 bucket for storing the output. Another optional input parameter KMSKeyId allows you to specify the AWS KMS customer master key (CMK) to use to encrypt the output. The user calling the Start operation must have permission to use the specified CMK.

The following diagram shows the overall workflow when you use the output preference parameter with the HAQM Textract asynchronous API.

  1. You start by calling the StartDocumentTextDetection or StartDocumentAnalysis API with an S3 object location, output S3 bucket name, output prefix for S3 path and KMS key ID, and a few additional parameters.
  2. HAQM Textract gets the document from the S3 bucket and starts a job to process the document.
  3. As the document is processed, HAQM Textract stores the JSON output at the path in the output bucket and encrypts it using the KMS CMK that was specified in the start call.
  4. You get a job completion notification via HAQM SMS.
  5. You can then call the corresponding GetDocumentTextDetection GetDocumentAnalysis to get the JSON result. You can also get the JSON result directly from the output S3 bucket at the path with the following format: s3://{S3Bucket}/{S3Prefix}/{TextractJobId}/*.

Starting the asynchronous job with OutputConfig

The following code shows how you can start the asynchronous API job to analyze a document and store encrypted inference output in a custom S3 bucket:

import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'string',
            'Name': 'string',
            'Version': 'string'
        }
    },
    ...
    OutputConfig={
        'S3Bucket': 'string',
        'S3Prefix': 'string'
    },
    KMSKeyId='string'
)

The following code shows how you can get the results API job to analyze a document:

response = client.get_document_analysis(JobId='string',MaxResults=123,NextToken='string')

You can also use AWS SDK to download output directly from your custom S3 bucket.

The following table shows how the HAQM Textract output is stored and encrypted based on the provided input parameters of OutputConfig and KMSKeyId.

OutputConfig KMSKeyId HAQM Textract Output
None None HAQM Textract output is stored internally by HAQM Textract and encrypted using AWS owned CMK.
Customer’s S3 Bucket None HAQM Textract output is stored in customer’s S3 bucket and encrypted using SSE-S3
None Customer managed CMK HAQM Textract output is stored internally by HAQM Textract and encrypted using Customer managed CMK.
Customer’s S3 bucket Customer managed CMK HAQM Textract output is stored in customer’s S3 bucket and encrypted using Customer managed CMK.

IAM permissions

When you use the HAQM Textract APIs to start an analysis or detection job, you must have access to the S3 object specified in your call. To take advantage of output preferences to write the output to an encrypted object in HAQM S3, you must have the necessary permissions for both the target S3 bucket and the CMK specified when you call the analysis or detection APIs.

The following example AWS Identity and Access Management (IAM) identity policy allows you to get objects from the textract-input S3 bucket with a prefix:

{
"Sid":"AllowTextractUserToReadInputData",
"Action":["s3:GetObject"],
"Effect":"Allow",
"Resource":["arn:aws:s3:::textract-input/documents/*"]
}

The following IAM identity policy allows you to write output to the textract-output S3 bucket with a prefix:

{
"Sid":"AllowTextractUserToReadInputData",
"Action":["s3:GetObject"],
"Effect":"Allow",
"Resource":["arn:aws:s3:::textract-input/documents/*"]
}

When placing objects into HAQM S3 using SSE-KMS, you need specific permissions on the CMK. The following CMK policy language allows a user (textract-start) to use the CMK to protect the output files from an HAQM Textract analysis or detection job:

{
  "Sid": "Allow use of the key to write Textract output to S3",
  "Effect": "Allow",
  "Principal": {"AWS":"arn:aws:iam::111122223333:user/textract-start"},
  "Action": ["kms:DescribeKey","kms:GenerateDataKey", "kms:ReEncrypt", "kms:Decrypt"],
  "Resource": "*"
}

The following KMS key policy allows a user (textract-get) to get the output file that’s backed by SSE-KMS.

{
"Sid": "Allow use of the key to read S3 objects for output",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::111122223333:user/textract-get"},
"Action": ["kms:Decrypt","kms:DescribeKey"],
"Resource": "*"
}

You must still have separate sections of the key policy to allow the management of the key.

For some workloads, you may need to provide a record of actions taken by a user, role, or an AWS service in HAQM Textract. HAQM Textract is integrated with AWS CloudTrail, which captures all API calls for HAQM Textract as events. For more information, see Logging HAQM Textract API Calls with AWS CloudTrail.

AWS KMS and HAQM S3 provide similar integration with CloudTrail. For more information, see Logging AWS KMS API calls with AWS CloudTrail and Logging HAQM S3 API calls using AWS CloudTrail, respectively. To get log visibility into HAQM S3 GETs and PUTs, you can enable the data trail for HAQM S3. This enables you to have end-to-end visibility into your document-processing lifecycle.

Conclusion

In this post, we showed you how to use the HAQM Textract asynchronous API and your S3 bucket and AWS KMS CMK to store and encrypt the results of HAQM Textract output. We also highlighted how you can use CloudTrail integration to get visibility into your overall document processing lifecycle.

For more information about different security controls in HAQM Textract, see Security in HAQM Textract.

 


About the Authors

Kashif Imran is a Principal Solutions Architect at HAQM Web Services. He works with some of the largest AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement computer vision applications at scale. His expertise spans application architecture, serverless, containers, NoSQL and machine learning.

 

 

 

Peter M. O’Donnell is an AWS Principal Solutions Architect, specializing in security, risk, and compliance with the Strategic Accounts team. Formerly dedicated to a major US commercial bank customer, Peter now supports some of AWS’s largest and most complex strategic customers in security and security-related topics, including data protection, cryptography, incident response, and CISO engagement.