AWS Big Data Blog

Access private code repositories for installing Python dependencies on HAQM MWAA

Customers who use HAQM Managed Workflows for Apache Airflow (HAQM MWAA) often need Python dependencies that are hosted in private code repositories. Many customers opt for public network access mode for its ease of use and ability to make outbound Internet requests, all while maintaining secure access. However, private code repositories may not be accessible via the Internet. It’s also a best practice to only install Python dependencies where they are needed. You can use HAQM MWAA startup scripts to selectively install Python dependencies required for running code on workers, while avoiding issues due to web server restrictions.

This post demonstrates a method to selectively install Python dependencies based on the HAQM MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your virtual private cloud (VPC).

Solution overview

This solution focuses on using a private Git repository to selectively install Python dependencies, although you can use the same pattern demonstrated in this post with private Python package indexes such as AWS CodeArtifact. For more information, refer to HAQM MWAA with AWS CodeArtifact for Python dependencies.

The HAQM MWAA architecture allows you to choose a web server access mode to control whether the web server is accessible from the internet or only from your VPC. You can also control whether your workers, scheduler, and web servers have access to the internet through your customer VPC configuration. In this post, we demonstrate an environment such as the one shown in the following diagram, where the environment is using public network access mode for the web servers, and the Apache Airflow workers and schedulers don’t have a route to the internet from your VPC.

mwaa-architecture

There are up to four potential networking configurations for an HAQM MWAA environment:

  • Public routing and public web server access mode
  • Private routing and public web server access mode (pictured in the preceding diagram)
  • Public routing and private web server access mode
  • Private routing and private web server access mode

We focus on one networking configuration for this post, but the fundamental concepts are applicable for any networking configuration.

The solution we walk through relies on the fact that HAQM MWAA runs a startup script (startup.sh) during startup on every individual Apache Airflow component (worker, scheduler, and web server) before installing requirements (requirements.txt) and initializing the Apache Airflow process. This startup script is used to set an environment variable, which is then referenced in the requirements.txt file to selectively install libraries.

The following steps allow us to accomplish this:

  1. Create and install the startup script (startup.sh) in the HAQM MWAA environment. This script sets the environment variable for selectively installing dependencies.
  2. Create and install global Python dependencies (requirements.txt) in the HAQM MWAA environment. This file contains the global dependencies required by all HAQM MWAA components.
  3. Create and install component-specific Python dependencies in the HAQM MWAA environment. This step involves creating separate requirements files for each component type (worker, scheduler, web server) to selectively install the necessary dependencies.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • An HAQM MWAA environment deployed with public access mode for the web server
  • Versioning enabled for your HAQM MWAA environment’s HAQM Simple Storage Service (HAQM S3) bucket
  • HAQM CloudWatch logging enabled at the INFO level for worker and web server
  • A Git repository accessible from within your VPC

Additionally, we upload a sample Python package to the Git repository:

git clone http://github.com/scrapy/scrapy
git clone http://git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy scrapylocal
rm -rf ./scrapy/.git*
cp -r ./scrapy/* ./scrapylocal
cd scrapylocal
git add --all
git commit -m "first commit"
git push

Create and install the startup script in the HAQM MWAA environment

Create the startup.sh file using the following example code:

#!/bin/sh

echo "Printing Apache Airflow component"
echo $MWAA_AIRFLOW_COMPONENT

if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
then
sudo yum -y install libaio
fi
if [[ "${MWAA_AIRFLOW_COMPONENT}" == "webserver" ]]
then
echo "Setting extended python requirements for webservers"
export EXTENDED_REQUIREMENTS="webserver_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "worker" ]]
then
echo "Setting extended python requirements for workers"
export EXTENDED_REQUIREMENTS="worker_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "scheduler" ]]
then
echo "Setting extended python requirements for schedulers"
export EXTENDED_REQUIREMENTS="scheduler_reqs.txt"
fi

Upload startup.sh to the S3 bucket for your HAQM MWAA environment:

aws s3 cp startup.sh s3://[mwaa-environment-bucket]
aws mwaa update-environment --startup-script-s3-path s3://[mwaa-environment-bucket]/startup.sh

Browse the CloudWatch log streams for your workers and view the worker_console log. Notice the startup script is now running and setting the environment variable.

log-startup-script

Create and install global Python dependencies in the HAQM MWAA environment

Your requirements file must include a –constraint statement to make sure the packages listed in your requirements are compatible with the version of Apache Airflow you are using. The statement beginning with -r references the environment variable you set in your startup.sh script based on the component type.

The following code is an example of the requirements.txt file:

--constraint http://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt
-r /usr/local/airflow/dags/${EXTENDED_REQUIREMENTS}

Upload the requirements.txt file to the HAQM MWAA environment S3 bucket:

aws s3 cp requirements.txt s3://[mwaa-environment-bucket]

Create and install component-specific Python dependencies in the HAQM MWAA environment

For this example, we want to install the Python package scrapy on workers and schedulers from our private Git repository. We also want to install pprintpp on the web server from the public Python packages indexes. To accomplish that, we need to create the following files (we provide example code):

  • webserver_reqs.txt:
prettyprint
  • worker_reqs.txt:
git+http://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy
  • scheduler_reqs.txt:
git+http://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

Upload webserver_reqs.txt, scheduler_reqs.txt, and worker_reqs.txt to the DAGs folder for the HAQM MWAA environment:

aws s3 cp webserver_reqs.txt s3://mwaa-environment/dags
aws s3 cp scheduler_reqs.txt s3://mwaa-environment/dags
aws s3 cp worker_reqs.txt s3://mwaa-environment/dags

Update the environment for the new requirements file and observe the results

Get the latest object version for the requirements file:

aws s3api list-object-versions --bucket [mwaa-environment-bucket]

Update the HAQM MWAA environment to use the new requirements.txt file:

aws mwaa update-environment --name [mwaa-environment-name] --requirements-s3-object-version [s3-object-version]

Browse the CloudWatch log streams for your workers and view the requirements_install log. Notice the startup script is now running and setting the environment variable.

log-requirements

log-git

Conclusion

In this post, we demonstrated a method to selectively install Python dependencies based on the HAQM MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your VPC.

We hope this post provided you with a better understanding of how startup scripts and Python dependency management work in an HAQM MWAA environment. You can implement other variations and configurations using the concepts outlined in this post, depending on your specific network setup and requirements.


About the Author

Tim Wilhoit is a Sr. Solutions Architect for the Department of Defense at AWS. Tim has over 20 years of enterprise IT experience. His areas of interest are serverless computing and ML/AI. In his spare time, Tim enjoys spending time at the lake and rooting on the Oklahoma State Cowboys. Go Pokes!