AWS Machine Learning Blog
Translating PDF documents using HAQM Translate and HAQM Textract
September 2024: This post was reviewed and updated for accuracy.
In 1993, the Portable Document Format or the PDF was born and released to the world. Since then, companies across various industries have been creating, scanning, and storing large volumes of documents in this digital format. These documents and the content within them are vital to supporting your business. Yet in many cases, the content is text-heavy and often written in a different language. This limits the flow of information and can directly influence your organization’s business productivity and global expansion strategy. To address this, you need an automated solution to extract the contents within these PDFs and translate them quickly and cost-efficiently.
In this post, we show you how to create an automated and serverless content-processing pipeline for analyzing text in PDF documents using HAQM Textract and translating them with HAQM Translate.
HAQM Textract automatically extracts text and data from scanned documents. HAQM Textract goes beyond simple OCR to also identify the contents of fields in forms and information stored in tables. This allows HAQM Textract to read virtually any type of document and accurately extract text and data without needing any manual effort or custom code.
Once the text and data are extracted, you can use HAQM Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms. The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content. Its asynchronous batch processing capability enables you to translate a large collection of text or HTML documents with a single API call.
Solution overview
To be scalable and cost-effective, this solution uses serverless technologies and managed services. In addition to HAQM Textract and HAQM Translate, the solution uses the following services:
- HAQM Simple Storage Service (HAQM S3) – Stores your documents and allows for central management with fine-tuned access controls.
- HAQM Simple Notification Service (HAQM SNS) – Enables you to decouple microservices, distributed systems, and serverless applications with a highly available, durable, secure, fully managed pub/sub messaging service.
- AWS Lambda – Runs code in response to triggers such as changes in data, changes in application state, or user actions. Because services like HAQM S3 and HAQM SNS can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
- AWS Step Functions – Coordinates multiple AWS services into serverless workflows.
Solution architecture
The architecture workflow contains the following steps:
- Users upload a PDF for analysis to HAQM S3.
- The HAQM S3 upload triggers a Lambda function.
- The function invokes HAQM Textract to extract text from the PDF in batch mode.
- HAQM Textract sends an SNS notification when the job is complete.
- A Lambda function reads the HAQM Textract response and stores the extracted text in HAQM S3.
- The Lambda function from the previous step invokes HAQM Translate in batch mode to translate the extracted texts into the target language.
- The HAQM Translate Job emits an HAQM EventBridge event when the job is complete and the configured HAQM EventBridge rule triggers an AWS Lambda function
- A Lambda function reads the translated texts in HAQM S3 and generates a translated document in HAQM S3.
The following diagram illustrates this architecture.
For processing documents at scale, you can expand this solution to include HAQM Simple Queue Service (HAQM SQS) to queue the jobs and handle any potential failure related to throttling and default service concurrency limits. For more information about the limits in HAQM Translate and HAQM Textract, see Guidelines and Limits and Limits in HAQM Textract, respectively.
Deploying the solution with AWS CloudFormation
Prerequisite
- AWS CloudTrail should be enabled.
The first step is to use an AWS CloudFormation template to provision the necessary resources needed for the solution, including the AWS Identity and Access Management (IAM) roles, IAM policies, and SNS topics.
- Launch the AWS CloudFormation template by choosing the following (this creates the stack in the
us-east-1
Region):
- For Stack name, enter a unique stack name for this account; for example,
document-translate
. - For TargetLanguageCode, enter the language code that you want your translated documents in; for example,
es
for Spanish.
For more information about supported languages, see Supported Languages and Language Codes.
- In the Capabilities and transforms section, and select the check-boxes to acknowledge that AWS CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.
AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.
- Choose Create stack.
The stack creation may take up to 20 minutes, after which the status changes to CREATE_COMPLETE
. You can see the name of the newly created S3 bucket on the Outputs tab.
Translating the document
To translate your document, upload a document in English to the input
folder of the S3 bucket you created in the previous step.
For this post, we scanned the “Universal Declaration of Human Rights,” created by the United Nations.
This upload event triggers the Lambda function <Stack name>-S3EventProcessor-<Random string>
, which invokes the HAQM Textract startDocumentTextDetection
API to extract the text from the scanned document.
When HAQM Textract completes the batch job, it sends an SNS notification. The notification triggers the Lambda function <Stack name>-TextractSNSEventProcessor-<Random string>
, which processes the HAQM Textract response page by page to extract the LINE
block elements to store them in the S3 bucket.
HAQM Textract extracts LINE
block elements with a BoundingBox
. A sentence in the scanned document results in multiple LINE
block elements. To make sure that HAQM Translate has the entire sentence in scope for translation, the solution combines multiple LINE
block elements to recreate the sentence boundary in the source document. This done by using the BreakIterator
class available for Java. For more information, see Class BreakIterator.
The sentences are then stored in the S3 bucket as individual objects. Finally, the HAQM Translate job startTextTranslationJob
is invoked with the input S3 bucket location where the text to be translated is available.
The HAQM EventBridge rule configured for the HAQM Translate job status change event EventBridge notification triggers the Lambda function <Stack name>-TranslateJobEventProcessor-<Random string>
. The function creates the editable document by combining the translated texts created by the HAQM Translate batch job in the output folder of the S3 bucket with the following naming convention: inputFileName-TargetLanguageCode.docx
.
The following screenshot shows the document translated in Spanish.
The solution also supports translating documents for right-to-left (RTL) script languages such as Arabic and Hebrew. The following screenshot shows the translated document in Arabic (language code: ar
).
For any pipeline failure, check the HAQM CloudWatch logs for the corresponding Lambda function and look for potential errors that caused the failure.
To do a translation in a different language, you can update the LANG_CODE
environment variable for the <Stack name>-TextractSEventProcessor-<Random string>
function and trigger the solution pipeline by uploading a new document into the input
folder of the S3 bucket.
Conclusion
In this post, we demonstrated how to extract text from PDF documents and translate them into an editable document in a different language using HAQM Translate asynchronous batch processing. For a low-latency, low-throughput solution translating smaller PDF documents, you can perform the translation through the real-time HAQM Translate API.
The ability to process data at scale is becoming important to organizations across all industries. Managed machine learning services like HAQM Textract and HAQM Translate can simplify your document processing and translation needs, helping you focus on addressing core business needs while keeping overall IT costs manageable.
For further reading, we recommend the following:
- Asynchronous Batch Processing
- Detecting and Analyzing Text in Multipage Documents
- Translating documents with HAQM Translate, AWS Lambda, and the new Batch Translate API
- Automatically extract text and structured data from documents with HAQM Textract
- Getting a batch job completion message from HAQM Translate
About the Authors
Siva Rajamani is a Boston-based Enterprise Solutions Architect for AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security. Outside of work, he enjoys outdoors activities and watching documentaries.
Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, Machine Learning, and Security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.
Erik Cordsen is a Solutions Architect at AWS serving customers in Georgia. He is passionate about applying cloud technologies and machine learning to solve real life problems. When he is not designing cloud solutions, Erik enjoys travel, cooking, and cycling.
Audit History
Last reviewed and updated in September 2024 by Erik Cordsen | Solutions Architect