Speed up training on HAQM SageMaker using HAQM FSx for Lustre and HAQM EFS file systems

April 2021 – The HAQM FSx section of this post has been updated to cover changes introduced to mount point names with scratch_2 and persistent_1 deployment options.

HAQM SageMaker provides a fully managed service for data science and machine learning workflows. One of the most important capabilities of HAQM SageMaker is its ability to run fully managed training jobs to train machine learning models.

Now, you can speed up your training job runs by training machine learning models from data stored in HAQM FSx for Lustre or HAQM Elastic File System (EFS). HAQM FSx for Lustre provides a high-performance file system natively integrated with HAQM Simple Storage Service (S3) and optimized for workloads such as machine learning, analytics, and high performance computing. HAQM EFS provides a simple, scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources.

Training machine learning models requires providing the training datasets to the training job. Until now, when using HAQM S3 as the training datasource in File input mode, all training data had to be downloaded from HAQM S3 to the EBS volumes attached to the training instances at the start of the training job. A distributed file system such as HAQM FSx for Lustre or EFS can speed up machine learning training by eliminating the need for this download step.

In this blog post, we go over the benefits of training your models using a file system, provide information to help you choose a file system, and show you how to get started.

Choosing a file system for training models on HAQM SageMaker

When considering whether you should train your machine learning models from a file system, the first thing to consider is: where does your training data reside now?

If your training data is already in HAQM S3, and your needs do not dictate faster training times for your training jobs, you can get started with HAQM SageMaker with no need for data movement. However, if you need faster startup and training times, we recommend that you take advantage of the HAQM FSx for Lustre file system, which is natively integrated with HAQM S3.

HAQM FSx for Lustre speeds up your training jobs by serving your HAQM S3 data to HAQM SageMaker at high speeds. The first time you run a training job, HAQM FSx for Lustre automatically copies data from HAQM S3 and makes it available to HAQM SageMaker. Additionally, the same HAQM FSx file system can be used for subsequent iterations of training jobs on HAQM SageMaker, preventing repeated downloads of common HAQM S3 objects. Because of this, HAQM FSx has the most benefit to training jobs that have training sets in HAQM S3, and in workflows where training jobs must be run several times using different training algorithms or parameters to see which gives the best result.

If your training data is already in an HAQM EFS file system, we recommend choosing HAQM EFS as the file system data source. This choice has the benefit of directly launching your training jobs from the data in HAQM EFS with no data movement required, resulting in faster training start times. This is often the case in environments where data scientists have home directories in HAQM EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with which fields or labels to include. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

Getting started with HAQM FSx for training on HAQM SageMaker

Note your training data HAQM S3 bucket and path.
Create an HAQM FSx file system with the desired size. Expand the Data repository integration. Choose HAQM S3 for Data repository type and specify the Import bucket and Import prefix corresponding to your HAQM S3 training data.
Once created, note your file system id and mount name.
Now, go to the HAQM SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
Create your training job:
1. Provide the ARN for the IAM role with the required access control and permissions policy. Refer to HAQMSageMakerFullAccess for details.
2. Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow Lustre traffic over port 988 to control access to the training dataset stored in the file system. For more details, refer to Getting started with HAQM FSx.
3. Choose file system as the data source and properly reference your file system id, type, and directory path. Note, the directory path begins with the mount name of the file system.
Launch your training job.

Getting started with HAQM EFS for training on HAQM SageMaker

Put your training data in its own directory in HAQM EFS.
Now go to the HAQM SageMaker console and open the Training jobs page to create the training job, associate VPC subnets, security groups, and provide the file system as the data source for training.
Create your training job:
1. Provide the IAM role ARN for the IAM role with the required access control and permissions policy
2. Specify a VPC that your training jobs and file system have access to. Also, verify that your security groups allow NFS traffic over port 2049 to control access to the training dataset stored in the file system.
3. Choose file system as the data source and properly reference your file system id, path, and format.
Launch your training job.

After your training job completes, you can view the status history of the training job to observe the faster download time when using a file system data source.

Summary

With the addition of HAQM FSx for Lustre and HAQM EFS as data sources for training machine learning models in HAQM SageMaker, you now have greater flexibility to choose a data source that is suited to your use case. In this blog post, we used a file system data source to train machine learning models, resulting in faster training start times by eliminating the data download step.

Go here to start training machine learning models yourself on HAQM SageMaker or refer to our sample notebook to train a linear learner model using a file system data source to learn more.

About the Authors

Vidhi Kastuar is a Sr. Product Manager for HAQM SageMaker, focusing on making machine learning and artificial intelligence simple, easy to use and scalable for all users and businesses. Prior to AWS, Vidhi was Director of Product Management at Veritas Technologies. For fun outside work, Vidhi loves to sketch and paint, work as a career coach, and spend time with his family and friends.

Will Ochandarena is a Principal Product Manager on the HAQM Elastic File System team, focusing on helping customers use EFS to modernize their application architectures. Prior to AWS, Will was Senior Director of Product Management at MapR.

Tushar Saxena is a Principal Product Manager at HAQM, with the mission to grow AWS’ file storage business. Prior to HAQM, he led telecom infrastructure business units at two companies, and played a central role in launching Verizon’s fiber broadband service. He started his career as a researcher at GE R&D and BBN, working in computer vision, Internet networks, and video streaming.

AWS Machine Learning Blog

Speed up training on HAQM SageMaker using HAQM FSx for Lustre and HAQM EFS file systems

Choosing a file system for training models on HAQM SageMaker

Getting started with HAQM FSx for training on HAQM SageMaker

Getting started with HAQM EFS for training on HAQM SageMaker

Summary

About the Authors

Resources

Blog Topics

Follow