AWS for Industries

Scalable Metadata Discovery Solution Accelerates Molecular Research

Healthcare and life sciences (HCLS) organizations are generating more data than ever as they integrate molecular data into their drug discovery, clinical development, molecular diagnostics, and population health analysis. Because of the amount of data, customers are looking for well-organized cost-effective storage that supports easy discovery and integration into their analysis ecosystem. Customers also want to aggregate the metadata about the files stored into a queryable format to power their data discovery and cohorting tools.

HAQM Web Services (AWS) HealthOmics is a managed service focusing on accelerating scientific breakthroughs with fully managed biological data stores and workflows. The HealthOmics managed stores are purpose-built to organize and annotate molecular files for discoverability. They also make the files accessible through HAQM Simple Storage Service (HAQM S3) APIs to enable seamless integration into the user’s analysis ecosystem. This is paired with usage driven tiering and compression to help drive additional cost savings. Because all these capabilities are available pre-built with HealthOmics, customers have adopted the purpose-built stores for their molecular data storage.

We heard from customers how they want to query across the metadata to power their discovery and cohorting tools. These tools are built to integrate with databases to support aggregating metadata from many sources and queries based on any piece of metadata available. One service they adopted to power these tools is HAQM DynamoDB, a fully managed, serverless NoSQL database service.

We’ll present a framework that mirrors the sequence store metadata into an HAQM DynamoDB table. This allows customers to use the power of DynamoDB to dynamically define their cohorts for rapid discovery of data stored across all the sequence stores in a region.

The need for discoverable metadata

HCLS organizations are using cutting-edge, high throughput lab techniques to observe cellular interactions with the goal of discovering new therapies and improving human health. The data generated by these techniques is only useful to the organization if it is discoverable. As part of their storage stack, many HCLS organizations have incorporated AWS HealthOmics sequence stores. Sequence stores store the objects and metadata as read sets. Read set metadata includes subject IDs, sample IDs, provenance, object relationships and other information about the objects.

While HealthOmics has APIs for finding read sets of interest, some users expressed the need for richer query and filtering capabilities on the metadata. Additionally, users have asked for database tables to include the metadata, so the information can quickly be integrated with broader structured data lakes—often about samples and subjects.

These challenges can result in reduced productivity and increased operational costs, with organizations reporting up to 60 percent of researcher’s time spent on cleaning and organizing data and an average delay of 2-3 weeks in starting new research projects due to data accessibility issues. An automated metadata synchronization solution becomes crucial for organizations managing large-scale genomic data workflows.

We present a cost-effective, serverless solution that automatically synchronizes AWS HealthOmics sequence store metadata into an HAQM DynamoDB table. Having the extracted metadata in a DynamoDB table enables queries with sub-second response times, regardless of data volume. It improves data accessibility through streamlined API calls, and maintains near real-time metadata updates. This verifies researchers can instantly access critical sample information, reducing project start times from weeks to hours, while maintaining minimal operational overhead through automated synchronization.

Make HealthOmics read set metadata queryable with DynamoDB tables

The solution assumes that you have an AWS account with appropriate permissions, an existing AWS HealthOmics sequence store and a basic understanding of AWS services and APIs.

The key components of the solution include:

  1. HAQM EventBridge to capture read set status change events
  2. HAQM Simple Queue Service (HAQM SQS) to provide reliable event buffering and retry capabilities
  3. AWS Lambda to process events and update metadata
  4. HAQM DynamoDB to store queryable metadata

The following figure provides an overview of the solution.

The architecture shows events generated by AWS HealthOmics sequence store coming from HAQM EventBridge into HAQM SQS, triggering an AWS Lambda function that writes/updates/deletes HAQM DynamoDB table entries. Events are processed based on read set status changes, verifying near real-time metadata synchronization.Figure 1: Solution Architecture for AWS HealthOmics sequence store metadata synchronization solution

Here is how this solution works.

1. AWS HealthOmics sequence store generates read set status update events.

2. HAQM EventBridge rules route events to an HAQM SQS FIFO queue. The FIFO queue prioritizes message order and confirms it’s scalable, exactly-once processing.

3. AWS Lambda function processes events based on status:

  • ACTIVE: Refreshes the read set’s complete metadata.
  • ACTIVATING/ARCHIVED/DELETING: Updates the read set’s status and timestamp.
  • DELETED: Removes the read set’s entry.
  • All other statuses are ignored.

4. HAQM DynamoDB maintains the current state of all metadata across all sequence stores in a region at a read set level.

The sample solution and additional technical details are available as an open-source CDK application. This solution also includes flexible deployment options and a utility backfill script for prepopulating the dynamo table with existing read set information.

Additional considerations for the solution

1. API Limits: The solution relies on calling GetReadSetMetadata, ListTagsForResource and GetSequenceStore APIs for constructing the read set metadata. If you experience Lambda function throttling, service limit increases can be requested through a support request.
2. Performance Optimization:

1. Set an appropriate visibility timeout for your HAQM SQS queue. It should be longer than your Lambda timeout.
2. Use HAQM SQS batching, of up to 10 events, for each Lambda invocation to optimize the compute needs.
3. DynamoDB on-demand offers a truly serverless database experience that automatically scales to accommodate the demanding read/writes without capacity planning. Consider an efficient backfill strategy for existing data in your AWS HealthOmics sequence store.

3. Cost Considerations: As this solution is leveraging serverless architecture, customers only pay for what they use without upfront commitment. Costs are directly tied to actual usage: DynamoDB charges are based on storage consumption and read/write operations (with choice of on-demand or provisioned capacity). HAQM SQS pricing depends on the number of API requests and message retention. AWS Lambda costs are calculated for every invocation and compute time (in milliseconds), based on the memory allocated. All services include a free tier, and additional charges may apply for data transfer, enhanced features, or cross-region operations. Check out the AWS pricing page.
4. Metadata Considering: If the backfill script is run and a read set is archived, the file information will not be populated. The file information will only become available once the read set is activated.
5. Design Considerations:

1. For multi-modal data discovery across data stored in AWS HealthOmics, AWS HealthLake, AWS HealthImaging, and other HAQM storage services, consider using the Multi-Modal Data Analysis with AWS Health and machine learning (ML) Services solution.
2. HAQM DynamoDB is a fully-managed NoSQL database service. However, for workloads demanding SQL and relational database functionality, explore other AWS offerings such as HAQM Aurora, HAQM Relational Database Service (HAQM RDS), or alternative database services that better align with your requirements.

Conclusion

We demonstrated how this serverless and scalable metadata synchronization solution delivers improved data findability and accessibility. All of this is done through streamlined API queries against an HAQM DynamoDB table, with near real-time metadata updates, for AWS HealthOmics sequence store read sets. To get started with the solution, visit the AWS Sample HealthOmics metadata sync GitHub.

Visit the AWS HealthOmics or AWS for Healthcare & Life Sciences to learn more. Contact an AWS Representative to know how we can help accelerate your business.

Further reading

Anuj Patel

Anuj Patel

Anuj Patel is a Senior Solutions Architect at AWS. He has M.S. in Computer Science and over 2 decades of experience in software design and development. He helps customers in Life Sciences industry through their AWS journey. His passion lies in simplifying complex problems, enabling customers to unlock the full potential of AWS solutions.

Domen Jemec

Domen Jemec

Domen Jemec is a Senior Product Manager for AWS HealthOmics. He is responsible for listening to customers, defining requirements, and ensuring that HAQM Web Services (AWS) helps customers advance scientific discovery and precision medicine. Domen has over 10 years of experience delivering technical and machine learning solutions across many industries including Biotech, Life Science, Healthcare, Pharmaceutical and Medical Device. He also continues to pursue research at the junction of computational biology and machine learning.