AWS Big Data Blog

Bolster security with role-based access control in HAQM MWAA

HAQM Studios invests in content that drives global growth of HAQM Prime Video and IMDb TV. HAQM Studios has a number of internal-facing applications that aim to streamline end-to-end business processes and information workflows for the entire content creation lifecycle. The HAQM Studios Data Infrastructure (ASDI) is a centralized, curated, and secure data lake that stores data, both in its original form and processed for analysis and machine learning (ML). The centralized ASDI is essential to break down data silos and combine different types of analytics, thereby allowing HAQM Studios to gain valuable insights, guide better business decisions, and innovate using the latest ML concepts.

What are the primary goals for HAQM MWAA adoption?

HAQM Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes it easier to run open-source versions of Apache Airflow on AWS. Builders at HAQM.com are engineering HAQM MWAA Directed Acyclic Graphs (DAGs) with prerequisites for provisioning the least privilege access model to the underlying services and resources, and restricting the blast radius of a given task.

Apache Airflow connections provide mechanisms for securely accessing the resources during DAG execution and are intended for coarse-grained access. Incorporating fine-grained access requires different mechanisms for implementation and code review prior to deployment. The additional challenge of codifying the infrastructure and stitching multiple systems together can also inject redundant activities when implementing fine-grained access patterns in Airflow.

How did HAQM achieve this goal?

The objective to enforce security for DAGs at its lowest possible granularity is done at the DAG’s task level. The solution aligns with integration of HAQM MWAA task security with AWS Identity and Access Management (IAM) service and AWS Security Token Service (AWS STS). The engineers customized the existing Airflow PythonOperators to tightly couple task access requirements to separately deployed IAM roles. The customized Airflow operator takes advantage of AWS STS to assume the associated IAM role. The temporary session created from AWS STS is used within PythonOperator to access the underlying resources required to run the task.

In this post, we discuss how to strengthen security in HAQM MWAA with role-based access control.

Prerequisites

To implement this solution, complete the following prerequisites:

  1. Create an AWS account with admin access.
  2. Create an HAQM MWAA environment.
    1. Note down the execution role ARN associated with the HAQM MWAA environment. This is available in the Permissions section of the environment.

  1. Create two HAQM Simple Storage Service (HAQM S3) buckets:
    1. s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/
    2. s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/
  2. Create two IAM roles; one for each of the buckets:
    1. write_access_processed_bucket with the following policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/*"
        }
    ]
}
    1. write_access_published_bucket with the following policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/*"
        }
    ]
}
  1. Update the trust relationship for the preceding two roles with the HAQM MWAA execution role obtained from HAQM MWAA environment page:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<AWS_ACCOUNT_ID>:assumed-role/<MWAA-EXECUTION_ROLE>/HAQMMWAA-airflow"
        ],
        "Service": "s3.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In the preceding policy, replace AWS_ACCOUNT_ID and MWAA-EXECUTION_ROLE with the respective account number, region and HAQM MWAA execution role.

Run the DAG

The proposed DAG has two tasks that access each of the preceding buckets created:

  • Process task – Performs a task in the processed S3 bucket, which mocks a transformation using the Python sleep() function. The last step in this task adds a control file with the current timestamp.
  • Publish task – Performs a similar transformation in the published S3 bucket, which again mocks a transformation using the Python sleep() function. The last step in this task adds a control file with the current timestamp.

The fine-grained access restriction is enforced by a custom implementation of a widely used Airflow operator: PythonOperator. The custom PythonOperator negotiates with AWS STS to trade a session using the IAM role. The session is exclusively used by the tasks’ callable to access the underlying AWS resources. The following diagram shows the sequence of events.

The source code for the preceding implementation is available in the mwaa-rbac-task GitHub repository.

The code base is set up in the following location in HAQM S3, as seen from the HAQM MWAA environment on the HAQM MWAA console.

Run the DAG and monitor its progress, as shown in the following screenshot.

After you run the DAG, the following files are created with timestamps updated:

  • s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/control_file/processed.json 
    	{
    		"processed_dt": "03/05/2021 01:03:58"
    	}
  • s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/control_file/published.json
    	{
    		"published_dt": "03/05/2021 01:04:12"
    	}

The change in the preceding control files reflects that the tasks in the DAGs enforced the policies defined for these tasks.

Create custom Airflow Operators to support least privilege access

You can extend the demonstrated methodology for enabling fine-grained access using a customized PythonOperator to other Airflow operators and sensors as needed. For more information about how to customize operators, see Creating a custom Operator.

Conclusion

In this post, we presented a solution to bolster security in HAQM MWAA with role-based access controls. You can extend the concept to other Airflow operators in order enhance the workflow security at the task level. In addition, using the AWS Cloud Development Kit (AWS CDK) can make provisioning the HAQM MWAA environment and fine-grained IAM task roles seamless. We look forward to sharing more about fine-grained access patterns for Airflow tasks in a future post.


About the Author

Kishan Desai is a Data Engineer at HAQM Studios building a data platform to support the content creation process. He is passionate about building flexible and modular systems on AWS using serverless paradigms. Outside of work, Kishan enjoys learning new technologies, watching sports, experiencing SoCal’s great food, and spending quality time with friends and family.

 

 

Virendhar (Viru) Sivaraman is a strategic Big Data & Analytics Architect with HAQM Web Services. He is passionate about building scalable big data and analytics solutions in the cloud. Besides work, he enjoys spending time with family, hiking & mountain biking.