Scale Reinforcement Learning with AWS Batch Multi-Node Parallel Jobs

Autonomous robots are becoming common in warehouse logistics, industrial manufacturing, and assembly. Developing robots requires simulation, along with techniques such as reinforcement learning (RL). Scaling RL with AWS Batch multi-node parallel (MNP) accelerates robotic development.

Autonomous robot applications are also expanding into diverse fields such as oceanography, mining, and space exploration. The complexity of these environments and tasks for robots requires simulation, along with techniques like reinforcement learning (RL) and Imitation Learning or Learning from Demonstrations (LfD) to train robots to execute sophisticated behaviors. Technologies like NVIDIA Isaac Lab enable such simulation.

However, the time and effort required to set up environments for this type of simulation and training poses a significant challenge. Moreover, as the complexity of environments and tasks increases, successful training requires more simulation and training steps for tasks such as movement, obstacle avoidance, and object manipulation.

Running NVIDIA’s simulation and training technology on AWS Batch multi-node parallel (MNP) infrastructure accelerates robotics development programs by providing a repeatable deployment, removing the barriers to fast, cost-effective robot training for complex tasks. In this blog, we will explore a few NVIDIA Isaac Lab training examples for robots, detail the architecture for executing them, and provide a link to a functioning example that you can run in your own AWS account.

NVIDIA Isaac Lab Learning Examples

In the first example (illustrated in Figure 1), we explore the task of training a cartpole. This task involves an inverted pendulum mounted on a robotic cart that can move along a wire. The objective is to keep the pole balanced in an upright position. Although it might appear unconventional, balancing tasks are widespread in bipedal and bi-wheeled robots.

Figure 1 – A video of using NVIDIA Isaac Lab to train cartpoles to balance upright with RL and simulation.

A single GPU is effective for executing this RL training. The GPU’s inherent parallelism allows for the simultaneous operation of multiple cartpole instances, updating the trained model based on the collective results from all instances at each training iteration. The training process leverages NVIDIA Isaac Lab, which in turn uses RL Games and PyTorch.

The next example involves training a robotic arm to open a drawer. This scenario is similar to training robots for pick-and-place operations in an industrial environment. The simulation (illustrated in Figure 2) features an array of robotic arms training concurrently, leveraging the same stack and methodology as the previous example, but it takes significantly more simulation and longer to train.

Figure 2 – Training robot arms to open a drawer.

The final example (illustrated in Figure 3) is the most complex. It involves training a humanoid robot to learn locomotion—not merely on flat surfaces, but across varied and uneven terrain. This sophisticated task demands the most extensive training and consequently requires the longest duration on a single GPU system.

Figure 3 – Training humanoid robots to walk on rough terrain.

Improving the performance of the analyses

The cartpole example is simple and requires only 100 epochs for training. It executes within two to three minutes on a single GPU system, such as an HAQM Elastic Cloud Compute (EC2) G6e instance. The more intricate robot arm example requires 1500 epochs and approximately 10-15 minutes to train on the same system. The humanoid example, which is even more complex, requires tens of thousands of epochs and several hours for training. To expedite these workloads, we can employ multiple GPUs to parallelize the training process. There are two strategies for this. First, we can increase the number of GPUs within a single system, and second, we can scale out horizontally by running multiple instances in parallel using AWS Batch MNP.

NVIDIA Isaac Lab on AWS Batch with multi-node parallel jobs

Figure 4 illustrates the architecture for deploying NVDIA Isaac Lab on AWS Batch. Initially, the user builds and tests a custom container with the NVIDIA Isaac Sim base image and the NVIDIA Isaac Lab repository. The user can do this with an HAQM EC2 instance with HAQM DCV for remote desktop display, for example. The user then uploads the container to the HAQM Elastic Container Registry (ECR). Following this, the user initiates an AWS Batch MNP job. This job automatically provisions compute and networking resources for the desired number of nodes in the MNP cluster, and NVIDIA Isaac Lab orchestrates internode communications. An HAQM Elastic File System (EFS) provides durable storage between batch runs, and the main node aggregates training updates from across the MNP cluster and persists them to HAQM EFS, including checkpoints for the trained behavior models as well as logs. The user can then perform post-training analysis and evaluation of this data in another AWS Batch job or in other environments such as HAQM EC2.

Figure 4 – Architecture diagram for NVIDIA Isaac Lab on AWS Batch

Steps for running NVIDIA Isaac Lab on AWS Batch

The Workshop NVIDIA Isaac Lab on AWS provides the detailed step-by-step instructions for running NVIDIA Isaac Lab in an AWS Batch MNP environment to train a humanoid robot. We summarize the steps briefly below:

Provision the cloud infrastructure – Use an AWS CloudFormation template that references a Dockerfile that encapsulates your simulation and training environment in a container. This automates a repeatable environment setup you can share across teams to provide standardization and save setup time. It provides everything for a complete simulation and training environment, including persistent HAQM EFS storage and a NAT Gateway to provide secure Internet access to the Isaac Lab container, in addition to an EC2 build and test instance, launch templates for the AWS Batch instances, security groups, IAM roles, and VPC configuration.
Review and validate the container – First run simulation and training on a single HAQM EC2 build and test instance to validate that the container is simulating your environment and robot correctly and that the training is executing as expected. Then push the validated container to HAQM ECR.
Launch an AWS Batch job – Next launch an AWS Batch job with the validated container, scaling horizontally to multiple nodes. AWS Batch takes care of orchestration, persists checkpoints and training results to HAQM EFS, and writes logs to HAQM CloudWatch.
Evaluate the trained model – Finally, play the trained model on the HAQM EC2 build and test instance to observe its performance and note any opportunities for improvement. You can continue to evolve the training and simulation models, using AWS Batch for rapid iteration to get sophisticated new robot behaviors to market faster.
Cleanup – Terminate or delete the AWS resources you used.

Conclusion

In this post, you learned how to use AWS Batch MNP jobs to expedite training of complex robotics systems with NVIDIA Isaac Lab to achieve sophisticated behaviors in a fraction of the time it takes on a single GPU instance. You have also seen how Docker containers greatly expedite the setup and provide reusable assets and consistent standards across large, distributed training and simulation programs. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads helping optimize cost at the same time as speeding new products to market. If you would like to learn more about how you can use NVIDIA tools on AWS to advance your robotics development or simulation efforts, please reach out to the authors or contact your account team.

AWS HPC Blog