AWS for Industries

FSI Service Spotlight: Featuring HAQM Textract

Editor’s note: This is the third in a monthly series for Financial Services Industry Service Spotlight.

Welcome to the Service Spotlight blog series. In this series, we plan to highlight five key considerations of a particular service that financial institutions should focus on to help streamline service approval. Each of the five areas will include specific guidance, which may need to be adapted to your specific use case and environment.

This edition of the Service Spotlight will feature HAQM Textract, a fully managed AI service that extracts text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify and understand the relationship of the data from forms and tables.

Financial institutions are leveraging HAQM Textract for a number of workloads across banking, capital markets, and insurance. In banking, BlueVine is a financial technology company that provides financing to small- and medium-sized businesses, and it developed a product leveraging HAQM Textract to automate the processing of Paycheck Protection Program (PPP) loan applications. In capital markets, PitchBook is a Morningstar company that tracks every aspect of the public and private equity markets, including venture capital, private equity, and M&A. PitchBook uses HAQM Textract to improve its processing of PDF documents as a part of its research process by as much as 60%. Lastly, in insurance, nib Group is an Australian healthcare fund that provides insurance to over 1.6 million people. nib Group leverages HAQM Textract to automate its claims processing pipeline resulting in a great customer experience while increasing its operational efficiencies.

Achieving Compliance with HAQM Textract

Security is a shared responsibility between AWS and you. AWS is responsible for protecting the infrastructure that runs the AWS services in the AWS Cloud and also provides you with services that you can use securely. Your responsibility is determined by the AWS service that you use. On the customer’s side of the shared responsibility model, customers should first determine their requirements for network connectivity, encryption, and access to other AWS resources. We will dive deeper into those topics in the upcoming sections.

HAQM Textract falls under the scope of the following compliance programs with regard to the AWS side of the shared responsibility model.

  • SOC 1,2,3
  • PCI
  • ISO/IEC 27001:2013, 27017:2015, 27018:2019
  • ISO/IEC 9001:2015
  • OSPAR
  • MTCS

In following sections, we will cover topics on the customer side of the shared responsibility model.

Data Protection with HAQM Textract

Encryption is typically employed to protect data. Although customers can access the HAQM Textract API using Transport Layer Security (TLS) 1.0, AWS recommends accessing the API over TLS 1.2 instead.

HAQM Textract also works in conjunction with AWS Key Management Service (AWS KMS) to allow customers to specify how customer data should be encrypted while it’s being processed and encrypt the results securely within the HAQM Textract service and in HAQM Simple Storage Service (HAQM S3). If customers request HAQM Textract to analyze data through its synchronous API, their data is stored and processed only in memory.

Financial services customers may require that the underlying AWS services do not store any customer data for service improvements or may store data only if it is encrypted with customer-managed keys for temporary periods such as model training or processing. HAQM Textract supports the use of AWS KMS customer-managed keys to encrypt any data stored at rest for the duration of processing. Likewise, the data in the customer-managed output bucket can also be encrypted via a customer-managed key.

network diagram for HAQM TextractFig 1: Figure showing network diagram for HAQM Textract. Control plane API calls for HAQM Textract can be made via the VPC endpoint as detailed in the next section. For the data plane, HAQM Textract accesses data in customer S3 buckets

Isolation of Compute Environments with HAQM Textract

HAQM Textract is a managed service that doesn’t have any compute resources in the customer’s side of the shared responsibility model. As a managed service, HAQM Textract is protected by the AWS global network security procedures that are described in the AWS Architecture Center: Security, Identity, & Compliance.

Customers can also establish a private connection between their VPC and HAQM Textract by creating an interface VPC endpoint. Interface endpoints are powered by AWS PrivateLink, a technology that enables you to privately access HAQM Textract APIs without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Instances in your VPC don’t need public IP addresses to communicate with HAQM Textract APIs. The use of interface VPC endpoints also ensures that traffic between your VPC and HAQM Textract does not leave the HAQM network. HAQM Textract also supports policy enforcement on VPC endpoints to restrict usage of HAQM Textract within your VPC. The following is an example of an endpoint policy for HAQM Textract. When attached to a VPC endpoint, this policy grants access to the specified HAQM Textract actions for all principals, but only if the principal belongs in your AWS organization.

{
   "Statement":[
      {
         "Principal":"*",
         "Effect":"Allow",
         "Action":[
            "textract:StartDocumentTextDetection",
            "textract:AnalyzeDocument",
            "textract:DetectDocumentText",
            "textract:GetDocumentAnalysis",
            "textract:GetDocumentTextDetection",
            "textract:StartDocumentAnalysis"
         ],
         "Resource":"*",
         "Condition": {
            "StringEquals": {
                "aws:PrincipalOrgID": [
                    "o-aabbccxxyyzz"
                ]
            }
         }
      }
   ]
}

Automating Audits with APIs with HAQM Textract

Financial institutions may be required to periodically audit their AWS services for usage, user activities, and any resource changes as part of their standard IT security and compliance policies. API calls in your AWS environment created by users or IAM roles or another AWS service can be logged using AWS CloudTrail. HAQM Textract supports the following API calls that are logged as events in CloudTrail:

The Analyze and DetectDocumentText are synchronous API calls whereas the APIs beginning with “start” are asynchronous APIs that require the user to provide an input and output data store to save the results of a job. You may, for example, want to ensure that when asynchronous API calls are created, users supply the appropriate output buckets in addition to KMS keys (using the KmsKeyID) parameter at the start of the call. For privacy reasons, CloudTrail will not log the image bytes or bounding box information, but rather log only the location of the document in HAQM S3.

For example, here is a sample output of a StartDocumentAnalysis API call to extract tables and forms from an SEC filing for HAQM.

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "*****************",
        "arn": "arn:aws:iam::<account-num>:user/john.doe",
        "accountId": "<account-num>",
        "accessKeyId": "*************",
        "userName": "john.doe"
    },
    "eventTime": "2021-01-19T18:41:25Z",
    "eventSource": "textract.amazonaws.com",
    "eventName": "StartDocumentAnalysis",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "205.251.233.179",
    "userAgent": "aws-cli/1.18.192 Python/3.7.3 Darwin/18.7.0 botocore/1.19.32",
    "requestParameters": {
        "documentLocation": {
            "s3Object": {
                "bucket": "sagemaker-project-1234",
                "name": "HAQM10K.pdf"
            }
        },
        "featureTypes": [
            "TABLES",
            "FORMS"
        ],
        "notificationChannel": {
            "sNSTopicArn": "arn:aws:sns:us-west-2:<account-num>:Textract_analyze_document",
            "roleArn": "arn:aws:iam::<account-num>:role/service-role/HAQMSageMaker-ExecutionRole-20190823T110499"
        }
    },
    "responseElements": {
        "jobId": "73af8679a6e341ac37104df58a2a96a3a33aafb32df5ac8dd83608445a2f697f"
    },
    "requestID": "f3739fdb-9a47-4f7c-b211-f7273ff92efc",
    "eventID": "61697da4-f569-45ff-8a4f-ad769d1d0911",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "eventCategory": "Management",
    "recipientAccountId": "<account-num>"
}

In addition to logging API calls, financial services customers may also want to automate the monitoring of any resource state changes or configuration changes for AWS services. AWS Config enables you to assess, audit, and evaluate the configurations of underlying services such as HAQM Textract using Custom Rules. Custom Rules are AWS Lambda functions that evaluate the underlying logic of the user-defined rule. The Lambda function can be periodically triggered by AWS Config, validate the rule and provides outputs to AWS Config.

Operational Access and Security with HAQM Textract

When processing datasets that are subject to PCI compliance, you may need to opt out of having your documents used to improve the quality of HAQM Textract. Many financial services customers may choose to either communicate with support or implement an organization-wide opt-out policy in AWS Organizations attached to your root account to all applicable AI services:

{
    "services": {
        "@@operators_allowed_for_child_policies": ["@@none"],
        "default": {
            "@@operators_allowed_for_child_policies": ["@@none"],
            "opt_out_policy": {
                "@@operators_allowed_for_child_policies": ["@@none"],
                "@@assign": "optOut"
            }
        }
    }
}

Alternatively, you can restrict this to a single service such as HAQM Textract:

{
    "services": {
        "textract": {
            "opt_out_policy": {
                "@@assign": "optOut",
                "@@operators_allowed_for_child_policies": ["@@none"]
            }
        }
    }
}

Customers may also require knowing who has access to what data and limiting user access as much as possible. For example if the initial extraction of information from unstructured documents is performed by data engineers, who then hand off the results of the job to data scientists for training natural language processing (NLP) models, you may want to limit access to HAQM Textract APIs to data engineers using IAM permissions. Furthermore, using IAM global condition keys, you can further ensure that previous control plane API calls are made via your VPC endpoint. For example, the following IAM policy, which can be added to the Principal making API calls to HAQM Textract, only allows API calls to HAQM Textract made via the VPC endpoint:

{
   "Statement":[
      {
         "Effect":"Deny",
         "Action":[
            "textract:StartDocumentTextDetection",
            "textract:AnalyzeDocument",
            "textract:DetectDocumentText",
            "textract:GetDocumentAnalysis",
            "textract:GetDocumentTextDetection",
            "textract:StartDocumentAnalysis"
         ],
         "Resource":"*",
         "Condition": {
                "StringNotEquals": {
                    "aws:sourceVpce": [
                        "vpce-111bbccc" # Textract VPC endpoint
                    ]
                }
            }
      }
   ]
}

Service control policies (SCPs) are a type of organization policy that you can use to manage permissions in your organization. SCPs offer central control over the maximum available permissions for all accounts in your organization. SCPs help you to ensure your accounts stay within your organization’s access control guidelines.

See the following for an example SCP that only allows synchronous HAQM Textract actions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Allow-Textract-Sync-Api-Only",
            "Effect": "Deny",
            "Action": [
                "textract:GetDocumentAnalysis",
                "textract:GetDocumentTextDetection",
                "textract:StartDocumentAnalysis",
                "textract:StartDocumentTextDetection"
            ],
            "Resource": "*"
        }
    ]
}

Bucket policies are resource-based policies that can be placed directly on a bucket to manage access to the data contained within it. You can restrict access to data in your bucket to HAQM Textract by using the special context key “calledViaFirst,” as shown in the following example:

{
   "Version": "2012-10-17",
   "Id": "Policy1415115909152",
   "Statement": [
     {
       "Sid": "Access-to-VPCE-and-Textract-Only",
       "Principal": "*",
       "Action": ["s3:GetObject",
                  "s3:PutObject"
  ],
       "Effect": "Deny",
       "Resource": ["arn:aws:s3:::vpc-restricted-bucket",
                    "arn:aws:s3:::vpc-restricted-bucket/*"],
       "Condition": {
         "StringNotEquals": {
           "aws:sourceVpce": "vpce-01ad5da5",
           "aws:CalledViaFirst": "textract.amazonaws.com"
         }
       }
     }
   ]
}

Conclusion

In this post, we reviewed HAQM Textract and highlighted key information that can help FSI customers accelerate the approval of the service within these five categories: achieving compliance, data protection, isolation of compute environments, automating audits with APIs, and operational access and security. While not a one-size-fits-all approach, the guidance can be adapted to meet your organization’s security and compliance requirements and provide a consolidated list of key areas for HAQM Textract.

In the meantime, be sure to visit our AWS Financial Services Industry blog channel and stay tuned for more financial services news and best practices.

Alvin Huang

Alvin Huang

Alvin Huang is a Capital Markets Specialist for Worldwide Financial Services Business Development at HAQM Web Services with a focus on data lakes and analytics, and artificial intelligence and machine learning. Alvin has over 19 years of experience in the financial services industry, and prior to joining AWS, he was an Executive Director at J.P. Morgan Chase & Co, where he managed the North America and Latin America trade surveillance teams and led the development of global trade surveillance. Alvin also teaches a Quantitative Risk Management course at Rutgers University and serves on the Rutgers Mathematical Finance Master’s program (MSMF) Advisory Board.

Gene Ting

Gene Ting

Gene Ting is a principal solutions architect at HAQM Web Services. He is focused on helping enterprise customers build and operate workloads securely on AWS. In his free time, Gene enjoys teaching kids technology and sports, as well as following the latest on cybersecurity.

Stefan Natu

Stefan Natu

Stefan Natu is a Principal Machine Learning Specialist at HAQM Web Services. He is focused on helping enterprise customers build, secure and operationalize machine learning solutions on AWS. His academic background is in theoretical physics, and in the past, he worked on a number of data science problems in retail and energy verticals. In his spare time, he enjoys reading machine learning blogs, traveling, playing the guitar, and exploring the food scene in New York City.