Posted On: Mar 13, 2024
The HAQM S3 Connector for PyTorch now supports saving PyTorch Lightning model checkpoints directly to HAQM S3, improving the cost and performance of your machine learning training jobs. PyTorch Lightning is an open source framework that provides a high-level interface for training with PyTorch. The HAQM S3 Connector for PyTorch automatically optimizes S3 requests to improve data loading and checkpoint performance for your training workloads. Saving PyTorch Lightning model checkpoints is up to 40% faster with the HAQM S3 Connector for PyTorch than writing to HAQM EC2 instance storage.
The HAQM S3 Connector for PyTorch delivers a new implementation of PyTorch Lightning's checkpoint primitive that you can use to save machine learning model checkpoints directly to HAQM S3. Model checkpointing typically requires pausing training jobs, so the time needed to save a checkpoint impacts overall training times. With this integration, you can save, load, and delete checkpoints directly from PyTorch Lightning training jobs to HAQM S3.
HAQM S3 Connector for PyTorch is an open source project. To get started, visit the GitHub page.