AWS Big Data Blog
Access private code repositories for installing Python dependencies on HAQM MWAA
Customers who use HAQM Managed Workflows for Apache Airflow (HAQM MWAA) often need Python dependencies that are hosted in private code repositories. Many customers opt for public network access mode for its ease of use and ability to make outbound Internet requests, all while maintaining secure access. However, private code repositories may not be accessible via the Internet. It’s also a best practice to only install Python dependencies where they are needed. You can use HAQM MWAA startup scripts to selectively install Python dependencies required for running code on workers, while avoiding issues due to web server restrictions.
This post demonstrates a method to selectively install Python dependencies based on the HAQM MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your virtual private cloud (VPC).
Solution overview
This solution focuses on using a private Git repository to selectively install Python dependencies, although you can use the same pattern demonstrated in this post with private Python package indexes such as AWS CodeArtifact. For more information, refer to HAQM MWAA with AWS CodeArtifact for Python dependencies.
The HAQM MWAA architecture allows you to choose a web server access mode to control whether the web server is accessible from the internet or only from your VPC. You can also control whether your workers, scheduler, and web servers have access to the internet through your customer VPC configuration. In this post, we demonstrate an environment such as the one shown in the following diagram, where the environment is using public network access mode for the web servers, and the Apache Airflow workers and schedulers don’t have a route to the internet from your VPC.
There are up to four potential networking configurations for an HAQM MWAA environment:
- Public routing and public web server access mode
- Private routing and public web server access mode (pictured in the preceding diagram)
- Public routing and private web server access mode
- Private routing and private web server access mode
We focus on one networking configuration for this post, but the fundamental concepts are applicable for any networking configuration.
The solution we walk through relies on the fact that HAQM MWAA runs a startup script (startup.sh
) during startup on every individual Apache Airflow component (worker, scheduler, and web server) before installing requirements (requirements.txt
) and initializing the Apache Airflow process. This startup script is used to set an environment variable, which is then referenced in the requirements.txt file to selectively install libraries.
The following steps allow us to accomplish this:
- Create and install the startup script (
startup.sh
) in the HAQM MWAA environment. This script sets the environment variable for selectively installing dependencies. - Create and install global Python dependencies (
requirements.txt
) in the HAQM MWAA environment. This file contains the global dependencies required by all HAQM MWAA components. - Create and install component-specific Python dependencies in the HAQM MWAA environment. This step involves creating separate requirements files for each component type (worker, scheduler, web server) to selectively install the necessary dependencies.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
- An HAQM MWAA environment deployed with public access mode for the web server
- Versioning enabled for your HAQM MWAA environment’s HAQM Simple Storage Service (HAQM S3) bucket
- HAQM CloudWatch logging enabled at the INFO level for worker and web server
- A Git repository accessible from within your VPC
Additionally, we upload a sample Python package to the Git repository:
Create and install the startup script in the HAQM MWAA environment
Create the startup.sh file using the following example code:
Upload startup.sh to the S3 bucket for your HAQM MWAA environment:
Browse the CloudWatch log streams for your workers and view the worker_console log. Notice the startup script is now running and setting the environment variable.
Create and install global Python dependencies in the HAQM MWAA environment
Your requirements file must include a –constraint statement to make sure the packages listed in your requirements are compatible with the version of Apache Airflow you are using. The statement beginning with -r
references the environment variable you set in your startup.sh
script based on the component type.
The following code is an example of the requirements.txt
file:
Upload the requirements.txt file to the HAQM MWAA environment S3 bucket:
Create and install component-specific Python dependencies in the HAQM MWAA environment
For this example, we want to install the Python package scrapy on workers and schedulers from our private Git repository. We also want to install pprintpp on the web server from the public Python packages indexes. To accomplish that, we need to create the following files (we provide example code):
webserver_reqs.txt
:
worker_reqs.txt
:
scheduler_reqs.txt
:
Upload webserver_reqs.txt
, scheduler_reqs.txt
, and worker_reqs.txt
to the DAGs folder for the HAQM MWAA environment:
Update the environment for the new requirements file and observe the results
Get the latest object version for the requirements file:
Update the HAQM MWAA environment to use the new requirements.txt
file:
Browse the CloudWatch log streams for your workers and view the requirements_install
log. Notice the startup script is now running and setting the environment variable.
Conclusion
In this post, we demonstrated a method to selectively install Python dependencies based on the HAQM MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your VPC.
We hope this post provided you with a better understanding of how startup scripts and Python dependency management work in an HAQM MWAA environment. You can implement other variations and configurations using the concepts outlined in this post, depending on your specific network setup and requirements.
About the Author
Tim Wilhoit is a Sr. Solutions Architect for the Department of Defense at AWS. Tim has over 20 years of enterprise IT experience. His areas of interest are serverless computing and ML/AI. In his spare time, Tim enjoys spending time at the lake and rooting on the Oklahoma State Cowboys. Go Pokes!