AWS Machine Learning Blog

Translating PDF documents using HAQM Translate and HAQM Textract

September 2024: This post was reviewed and updated for accuracy.

In 1993, the Portable Document Format or the PDF was born and released to the world. Since then, companies across various industries have been creating, scanning, and storing large volumes of documents in this digital format. These documents and the content within them are vital to supporting your business. Yet in many cases, the content is text-heavy and often written in a different language. This limits the flow of information and can directly influence your organization’s business productivity and global expansion strategy. To address this, you need an automated solution to extract the contents within these PDFs and translate them quickly and cost-efficiently.

In this post, we show you how to create an automated and serverless content-processing pipeline for analyzing text in PDF documents using HAQM Textract and translating them with HAQM Translate.

HAQM Textract automatically extracts text and data from scanned documents. HAQM Textract goes beyond simple OCR to also identify the contents of fields in forms and information stored in tables. This allows HAQM Textract to read virtually any type of document and accurately extract text and data without needing any manual effort or custom code.

Once the text and data are extracted, you can use HAQM Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms. The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content. Its asynchronous batch processing capability enables you to translate a large collection of text or HTML documents with a single API call.

Solution overview

To be scalable and cost-effective, this solution uses serverless technologies and managed services. In addition to HAQM Textract and HAQM Translate, the solution uses the following services:

  • HAQM Simple Storage Service (HAQM S3) – Stores your documents and allows for central management with fine-tuned access controls.
  • HAQM Simple Notification Service (HAQM SNS) – Enables you to decouple microservices, distributed systems, and serverless applications with a highly available, durable, secure, fully managed pub/sub messaging service.
  • AWS Lambda – Runs code in response to triggers such as changes in data, changes in application state, or user actions. Because services like HAQM S3 and HAQM SNS can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
  • AWS Step Functions – Coordinates multiple AWS services into serverless workflows.

Solution architecture

The architecture workflow contains the following steps:

  1. Users upload a PDF for analysis to HAQM S3.
  2. The HAQM S3 upload triggers a Lambda function.
  3. The function invokes HAQM Textract to extract text from the PDF in batch mode.
  4. HAQM Textract sends an SNS notification when the job is complete.
  5. A Lambda function reads the HAQM Textract response and stores the extracted text in HAQM S3.
  6. The Lambda function from the previous step invokes HAQM Translate in batch mode to translate the extracted texts into the target language.
  7. The HAQM Translate Job emits an HAQM EventBridge event when the job is complete and the configured HAQM EventBridge rule triggers an AWS Lambda function
  8. A Lambda function reads the translated texts in HAQM S3 and generates a translated document in HAQM S3.

The following diagram illustrates this architecture.

Architecture Diagram showing the workflow how uploading the PDF document to S3 bucket triggers the process of extracting text using HAQM textract and then translating it using HAQM Translate.

For processing documents at scale, you can expand this solution to include HAQM Simple Queue Service (HAQM SQS) to queue the jobs and handle any potential failure related to throttling and default service concurrency limits. For more information about the limits in HAQM Translate and HAQM Textract, see Guidelines and Limits and Limits in HAQM Textract, respectively.

Deploying the solution with AWS CloudFormation

Prerequisite

The first step is to use an AWS CloudFormation template to provision the necessary resources needed for the solution, including the AWS Identity and Access Management (IAM) roles, IAM policies, and SNS topics.

  1. Launch the AWS CloudFormation template by choosing the following (this creates the stack in the us-east-1 Region):
  2. For Stack name, enter a unique stack name for this account; for example, document-translate.
  3. For TargetLanguageCode, enter the language code that you want your translated documents in; for example, es for Spanish.

For more information about supported languages, see Supported Languages and Language Codes.

  1. In the Capabilities and transforms section, and select the check-boxes to acknowledge that AWS CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.

AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.

  1. Choose Create stack.

Screenshot showing Cloudformation launch page with stack name and input parameters as examples

The stack creation may take up to 20 minutes, after which the status changes to CREATE_COMPLETE. You can see the name of the newly created S3 bucket on the Outputs tab.

Translating the document

To translate your document, upload a document in English to the input folder of the S3 bucket you created in the previous step.

screenshot of S3 bucket uploaded with the document for translation

For this post, we scanned the “Universal Declaration of Human Rights,” created by the United Nations.

screenshot of a sample scanned document-UN Declaration of Human Rights in english

This upload event triggers the Lambda function <Stack name>-S3EventProcessor-<Random string>, which invokes the HAQM Textract startDocumentTextDetection API to extract the text from the scanned document.

When HAQM Textract completes the batch job, it sends an SNS notification. The notification triggers the Lambda function <Stack name>-TextractSNSEventProcessor-<Random string>, which processes the HAQM Textract response page by page to extract the LINE block elements to store them in the S3 bucket.

HAQM Textract extracts LINE block elements with a BoundingBox. A sentence in the scanned document results in multiple LINE block elements. To make sure that HAQM Translate has the entire sentence in scope for translation, the solution combines multiple LINE block elements to recreate the sentence boundary in the source document. This done by using the BreakIterator class available for Java. For more information, see Class BreakIterator.

The sentences are then stored in the S3 bucket as individual objects. Finally, the HAQM Translate job startTextTranslationJob is invoked with the input S3 bucket location where the text to be translated is available.

The HAQM EventBridge rule configured for the HAQM Translate job status change event EventBridge notification triggers the Lambda function <Stack name>-TranslateJobEventProcessor-<Random string>. The function creates the editable document by combining the translated texts created by the HAQM Translate batch job in the output folder of the S3 bucket with the following naming convention: inputFileName-TargetLanguageCode.docx.

screenshot of S3 bucket output folder where translated document is located.

The following screenshot shows the document translated in Spanish.

screenshot of a input sample scanned document (UN Declaration of Human Rights) translated from English to Spanish as output

The solution also supports translating documents for right-to-left (RTL) script languages such as Arabic and Hebrew. The following screenshot shows the translated document in Arabic (language code: ar).

screenshot of a input sample scanned document (UN Declaration of Human Rights) translated from English to Arabic as output

For any pipeline failure, check the HAQM CloudWatch logs for the corresponding Lambda function and look for potential errors that caused the failure.

To do a translation in a different language, you can update the LANG_CODE environment variable for the <Stack name>-TextractSEventProcessor-<Random string> function and trigger the solution pipeline by uploading a new document into the input folder of the S3 bucket.

Conclusion

In this post, we demonstrated how to extract text from PDF documents and translate them into an editable document in a different language using HAQM Translate asynchronous batch processing. For a low-latency, low-throughput solution translating smaller PDF documents, you can perform the translation through the real-time HAQM Translate API.

The ability to process data at scale is becoming important to organizations across all industries. Managed machine learning services like HAQM Textract and HAQM Translate can simplify your document processing and translation needs, helping you focus on addressing core business needs while keeping overall IT costs manageable.

For further reading, we recommend the following:


About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect for AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security. Outside of work, he enjoys outdoors activities and watching documentaries.

Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, Machine Learning, and Security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.

Erik Cordsen is a Solutions Architect at AWS serving customers in Georgia. He is passionate about applying cloud technologies and machine learning to solve real life problems. When he is not designing cloud solutions, Erik enjoys travel, cooking, and cycling.


Audit History

Last reviewed and updated in September 2024 by Erik Cordsen | Solutions Architect