AWS Marketplace
Masking Patient Data with DataMasque’s template for HAQM HealthLake
Healthcare organizations are moving their healthcare data to AWS in order to use the latest AWS services to improve care and provide more elegant patient and clinician experiences. However, regulations like the United States’ Health Insurance Portability and Accountability Act (HIPAA) and Europe’s General Data Protection Act (GDPR) mandate the need to protect sensitive patient health information and disclose how health data will be used. For healthcare customers building solutions with clinical data, this means you must provide your developers, analysts, researchers, and others with high-quality, production-realistic data to perform their jobs effectively, while also ensuring data is secure at all times.
De-identifying Protected Health Information (PHI) is a process used to accomplish this. If not done properly, however, de-identification can lead to the proliferation of PHI into non-regulated environments and increase the likelihood of experiencing a privacy and data breach. You must manage PHI to meet regulatory and compliance requirements in a way that enables the organization to innovate and solve its clinical challenges.
In this post, Brian, Snehanshu, and I’ll show you how to mask healthcare data for regulatory compliance using HAQM HealthLake and DataMasque.
Why use HAQM HealthLake with DataMasque
HAQM HealthLake is a HIPAA-eligible AWS service for analyzing healthcare data at scale. It uses the Fast Healthcare Interoperability Resources (FHIR) standard and enables customers to run SQL queries, build dashboards, and create models on their clinical data. If you want to use HAQM HealthLake using de-identified data, DataMasque can help. DataMasque is an AWS Partner with a proven data masking solution and an HAQM HealthLake partner.
DataMasque’s FHIR masking solution meets HIPAA, GDPR, and PHI requirements for health organizations by removing PHI and PII data from databases and S3 buckets and replaces it with synthetic or masked data. With DataMasque’s template for FHIR Patient resources, customers can start protecting the 18 identifiers defined within HIPAA and PHI right away, while still getting the clinical value from HAQM HealthLake.
The following architecture diagram shows unmasked PHI going through HAQM HealthLake and into DataMasque. DataMasque outputs production masked PHI, which can then be used in HAQM SageMaker, HAQM QuickSight, and AWS Lake Formation.
DataMasque FHIR masking solution overview
In DataMasque, customers are in full control over where and what data to mask. DataMasque is deployed on a virtual machine and can be run in either on-premises environments or in the AWS cloud.
Before sending sensitive data to HAQM HealthLake, organizations can either mask the data on-premises or mask the data in their AWS environment.
1. Mask on-premises
The following diagram shows an organization with sensitive data such as FHIR resources, HL7 CDA, insurance claims, and related information stored on their on-premise infrastructure. This organization can mask the data on-premises. This masked data can be saved on-premises in RDBMS, Data Warehouse, Parquet, ARVO*, JSON, XML, CSV and fixed-width format.
This organization may also want to send unmasked data to AWS to enhance the experiences of patients and clinicians. However, it’s important to ensure the privacy and security of this data by masking it before pushing it into HAQM HealthLake. In such a scenario, you can mask the data on-premises before sending it to HAQM HealthLake.
2. Mask in your AWS environment
In the following diagram, the customer loads unmasked data into HAQM HealthLake, which is exported to HAQM S3 bucket using HAQM HealthLake’s import and export functionality. Once the data is in S3, DataMasque masks FHIR data, and the masked data is sent to HAQM HealthLake. The masked data is stored in a “clean” HAQM S3 bucket. This data can now be used in HAQM HealthLake and all downstream HAQM services.
3. Use Masked data into downstream AWS services
In both scenarios, once the masked data is loaded into HAQM HealthLake, it is automatically added to your AWS Data Catalog, which allows you to use a range of downstream AWS services. You can now use this masked data in HAQM SageMaker, HAQM QuickSight, HAQM RDS, HAQM Aurora, HAQM Lake Formation, and HAQM Data Exchange. Refer to the following diagram.
By using DataMasque in this architecture, you can transfer FHIR data from any environment on-premises or on AWS to a common “clean” HAQM S3 bucket. From there, you can use HAQM HealthLake on that data, as well as all of the other HAQM services downstream from HealthLake. In this example, we ran DataMasque on an FHIR Patient resource, and it changed the PHI while still maintaining the clinical value of the data. With that data, we then loaded it into HAQM HealthLake, and it automatically became part of AWS Data Catalog, enabling us to use it downstream AWS services.
Prerequisites
To start masking your FHIR data with DataMasque, you need the following AWS resources:
- An AWS Account
- An S3 bucket with unmasked data in FHIR R4 format. (You can download sample FHIR data here).
- An empty S3 bucket with public access disabled
Solution walkthrough: Masking patient data with DataMasque’s template for HAQM HealthLake
You can mask FHIR data before and after the data is processed by HAQM HealthLake. To mask FHIR patient data using DataMasque’s built-in FHIR Patient masking template, follow these steps:
1. Deploy DataMasque in your AWS environment
- Sign in to your AWS account. In a browser, navigate to AWS Marketplace and search for DataMasque or follow this link: DataMasque PHI Masking. In the upper right, choose Continue to Subscribe and follow the subscription wizard.
- To access the deployed DataMasque instance, in a browser, do the following:
- Open the HAQM EC2 console.
- In the navigation pane, choose the EC2 instance hosting the DataMasque Instance.
- In the Details pane, copy the Public IPv4 or Private IPv4 addresses.
- In a web browser tab replace <instance-ip-address> in the following URL with the IP address copied in step 1.2.c: http://<instance-ip-address>
2. Prepare data source/destination and masking ruleset
- In the DataMasque instance you accessed in step 1.2.d, at the top navigation, select File Masking Dashboard.
- To create a Source Destination, in the Data Sources pane or Data Destinations pane, choose the + icon. Use the following parameters:
- Connection name: a unique name for the connection on the deployed DataMasque instance.
- Connection type: select AWS S3 Bucket from the dropdown list.
- Base directory: select the target folder in the selected S3 bucket.
- Bucket name: specify the name of the target S3 bucket.
- For Use as, the following options are available. This option determines if this is a Data Source or Data Destination or both a Source & Destination connection:
- Select Source for: out-of-place masking which DataMasque will read from them for masking. You must create a Destination connection separately if this option is selected.
- Select Source & Destination for: in-place masking which DataMasque will read from and write out masked data to.
- Select Destination for: out-of-place masking which DataMasque will write out the masked data to. You must create a Source connection separately if you choose this option. Related information: File Connections User Guide.
3. Perform a masking run on your data
- Navigate back to the File Masking Dashboard. To do this, in the top navigation, choose File Masking.
- To review the built-in FHIR Patient masking ruleset, in the Rulesets section, choose the pencil icon next to the fhir_patient_resource ruleset. You can modify the built-in FHIR patient masking template to capture any additional masking requirements. To save changes you make to the ruleset, select Save or Save And Exit. If you don’t make any changes, select Back to Dashboard.
- Select the source connection you configured in step 2.2.
- In the Rulesets section, select fhir_patient_resource ruleset. In the Data Destinations section, select the destination connection you configured in step 2.2.
- In the bottom right corner, select the Preview Run button.
- On the Confirm Run page, review information on the Source connection, ruleset name and the Data Destination connection. To proceed with the masking run, choose the START RUN button. Your masking run is in progress! The masking duration is dependent on how many HAQM S3 objects DataMasque is masking.
- When the masking run is completed, in the Masking Run top right corner, the Status of the masking run changes to Finished with a green background color.
You can now import your masked data to AWS HealthLake. Find detailed instructions for creating AWS HealthLake Data Store and importing file into AWS HealthLake Data Store in the HAQM HealthLake Developer Guide.
You can now view the masked FHIR data. The following image shows an unmasked and a masked example of FHIR PHI data. The unmasked example shows a data structure that includes Joe, Bloggs, male, a birthdate of 1-9-1964, a city of Haverhill, a state of MA, and a zip code of 10830. The masked example shows a data structure including Mr. John Doe, male, a birthdate of 11-12-1964, a city of Boston, a state of MA, and a zip code of 02108.
In this example, you can see that given and family names changed, the date of birth changed while still keeping the age intact, and addresses changed but are still a valid combination. Other PII and PHI were altered, but the patient’s allergies, medications, encounters, and other health data remained intact. This masked data, although altered, is clinically relevant and safe to use in both development and testing environment. Customers can use this masked data in HAQM Sagemaker to build-test-deploy models, such as predicting patients mortality within 90 days after ICU discharge OR in HAQM QuickSight to create a population health monitoring dashboard.
Cleanup
- Stop the DataMasque EC2 when data masking runs are not required.
- Delete S3 bucket used for masked and unmasked data.
- For out-of-place masking, you might want to delete your Source Bucket if it is no longer required.
Conclusion
In this post, Brian, Snehanshu, and I showed you how to mask healthcare data for regulatory compliance using HAQM HealthLake and DataMasque.
Adhering to de-identification regulatory requirements while ensuring the data retains usefulness for data consumers requires a specialized toolset. Data masking is an integral part of a healthcare organization’s data security strategy and the need for high quality, de-identified data is key for building new solutions that will improve care and healthcare delivery.
Next steps
Explore DataMasque’s solutions, available in AWS Marketplace.