Running GPU-based container applications with HAQM ECS Anywhere

Tens of thousands of customers have already migrated their on-premises workloads to the cloud for the past decade, however we’ve also seen a number of workloads that are not simply able to move to the cloud. Rather, those workloads are needed to remain on-premise due to data residency, network latency, regulatory, or compliance considerations.

Back in May 2021, HAQM Elastic Container Service (HAQM ECS) announced the general availability of HAQM ECS Anywhere (ECS Anywhere) to solve the use cases described earlier, as a simplified way for customers to run and manage containerized applications on-premises. ECS Anywhere added a new “EXTERNAL” container launch type to HAQM ECS in addition to existing EC2 and FARGATE launch types. With this new capability, customers are now able to run containers in their own compute hardware using the HAQM ECS APIs in the AWS Region, without running and operating their own container orchestrators.

GPU-based workloads now enabled with ECS Anywhere

While HAQM ECS and ECS Anywhere enable customers to easily leverage their own hardware to solve problems using containers across their hybrid footprint, customers still needed to find another way other than ECS Anywhere to run GPU-based container workloads in their data centers – until today. Now HAQM ECS supports GPU-based container workloads with ECS Anywhere, and this enables customers to run those GPU-based container workloads with the same experience they do with HAQM ECS in the AWS Region.

GPUs today are widely used in various areas, not only for machine learning but for 3D visualization, image processing, and big data workloads for instance. With HAQM ECS Anywhere GPU support, you can now simply run and manage those workloads using containers in their data centers without the need to transfer data to the cloud, and the need to operate their own container orchestrators for those workloads. This is also great news for customers who have made significant investments in their on-premises GPUs, because now they can use ECS Anywhere to make use of their existing GPU investment removing the operational overhead they have today with current toolsets like Docker Swarm or Kubernetes.

Walk-through ECS Anywhere with GPU support

Let’s briefly walk-through the new ECS Anywhere capability step by step. We’re first going to 1) obtain a registration command, then 2) register a machine with a GPU device to an existing HAQM ECS cluster. Next we will 3) register a simple HAQM ECS task definition, and finally 4) run an HAQM ECS task in the external machine through the HAQM ECS APIs.

In the following steps I’m going to use my empty HAQM ECS cluster named “ecsAnywhere-gpu”, but of course you can use any existing ECS cluster to follow the steps.

1. Obtaining a registration command for an external instance

At this moment there is no registered ECS instance in this HAQM ECS cluster as you can see in the screenshot below. To register external instances to HAQM ECS cluster, the first thing you will do is selecting Register External Instances. Note that the HAQM ECS console version 1 supports ECS Anywhere today. Make sure you’re using the console with the top left checkbox New ECS Experience turned off.

Registering external instance to ECS cluster

Now you see a dialog window something like the following figure. The underlined Instance role “ecsExternalInstanceRole” is an IAM role for external instances which I have created beforehand based on the steps in the ECS Anywhere documentation. Make sure you have created one before proceeding to the next step.

I’m going to use the default values for Activation key duration and Number of instances here, but you can change them based on your needs. See the details in the ECS Anywhere documentation.

Select Next step, then you will get a registration command in the next view, which you will execute in your external machine to register it to the HAQM ECS cluster.

The first step to register an external instance

Copy the registration command shown in the dialog window, and paste it in your text editor to add the --enable-gpu flag at the end of the registration command as follows. Before proceeding to the next step, you may want to close the dialog window by selecting Done in the ECS management console.

# NOTE: I added few line breaks for making this easier to read
curl --proto "https" \
  -o "/tmp/ecs-anywhere-install.sh" "http://amazon-ecs-agent.s3.amazonaws.com/ecs-anywhere-install-latest.sh" \
  && bash /tmp/ecs-anywhere-install.sh \
    --region "us-east-1" \
    --cluster "ecsAnywhere-gpu" \
    --activation-id "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx" \
    --activation-code "XXXXxxxxXXXXxxxxXXXX" \
    --enable-gpu # ADD THIS!

2. Execute the registration command in the external machine

Once you get into a shell in your external machine, just execute the registration command you edited in the previous step, as root. It automatically set ups and registers your machine, then the machine will show up as an external ECS instance in the ECS management console.

After a few minutes, you may now see “You can check your ECS cluster here <ECS Console URL>” in the command output as below. If you see the command fails, please check the documentation to see the network requirements and the supported operating systems in the ECS Anywhere documentation. Also, HAQM ECS and ECS Anywhere today supports the Nvidia kernel drivers and Docker GPU runtimes to schedule HAQM ECS tasks. To install and configure those required components in your external machine, please read the guide from Nvidia (this one is for Tesla, for example).

Registration command output

Let’s open the URL to see what’s happening on your HAQM ECS cluster now.

External instance joined

Boom! Now HAQM ECS is aware of your external GPU machine as capacity for the HAQM ECS cluster to run your HAQM ECS tasks 🙌

3. Register a sample HAQM ECS task definition

The final step before running an HAQM ECS task, is registering a sample HAQM ECS task definition below. As you can see the HAQM ECS task you’re going to deploy will use the nvidia/cuda container image and run the famous nvidia-smi command (it simply prints some information about GPU devices in your machine and then exits immediately) to make sure the task is using a GPU device in your external machine.

{
  "containerDefinitions": [
    {
      "memory": 200,
      "essential": true,
      "name": "cuda",
      "image": "nvidia/cuda:11.0-base",
      "resourceRequirements": [{
        "type":"GPU",
        "value": "1"
      }],
      "command": [
        "sh", "-c", "nvidia-smi"
      ],
      "cpu": 100
    }
  ],
  "family": "example-ecs-anywhere-gpu"
}

Open the task definition window in the ECS management console, then select Create new Task Definition to register the above JSON.

Selecting "Create new Task Definition"

Choose EXTERNAL, then select Next step.

Selecting launch type for task definition

Scroll the view down to the bottom in the next window, and select Configure via JSON, then you’ll see a dialog window to configure this task definition via JSON directly.

Selecting "Configure via JSON"

Paste the sample HAQM ECS task definition JSON described earlier in this section, then select Save at the right bottom.

Configuring via JSON

Then, select Create to register your HAQM ECS task definition, then you’ll see a message like “Created Task Definition successfully” in the top of the window.

4. Run a GPU-based HAQM ECS task in the external ECS instance

Finally, you’re now ready to run your first GPU-based HAQM ECS task in the external ECS instance. Select Clusters in the side bar in the ECS management console, then select your HAQM ECS cluster in the cluster list.

Open ECS cluster

In the cluster view you opened, select the Tasks tab and then Run new task.

Run new task

In the next window, ensure you chose EXTERNAL and the task definition “example-ecs-anywhere-gpu” you’ve just registered in the previous step, then select Run Task at the right bottom of the window.

Run task with specified task definition

It will redirect you to the HAQM ECS cluster window, and you may find there is one HAQM ECS task scheduled onto the external ECS instance as shown in the screenshot below.

Task successfully scheduled

Let’s select the Task ID (d2e6e7… in this case) as highlighted in the above screenshot, to see the HAQM ECS task details. It may still show PENDING at the time you opened the next window, but it’ll be soon updated to RUNNING and STOPPED. You may also find that the container successfully ran the nvidia-smi command and exited with Exit Code “0” as shown below.

Task successfully ran and stopped

If you want to see the actual output from the nvidia-smi command, execute docker ps -a in the external instance to find container ID, and then execute docker logs <container ID> to see the output as container logs like below. In production workloads, you may also want to configure an HAQM ECS task execution IAM role and the awslogs log driver for example for your HAQM ECS task, to collect and view your container logs in the AWS Region instead of SSH into the external instances.

nvidia-smi command output

Conclusion

You can extend your HAQM ECS cluster in an AWS Region to your data centers by registering your on-premise servers as capacity for the cluster with ECS Anywhere. It allows you to use the existing HAQM ECS APIs and the control-plane fully-managed by AWS, to run your containers in the external capacities. There is no need to run and operate your own container orchestrator for just running containers in the data centers, and now this is also true for GPU-based container workloads with the GPU support in ECS Anywhere.

We encourage you to give it a try with the ECS Anywhere workshop as a next step. Starting from creating a virtual machine with Vagrant in your laptop or somewhere, you’ll be able to learn a full lifecycle of an ECS Anywhere workload like understanding how ECS Anywhere works behind the scenes, building CI/CD pipelines for container deployments, and ensuring workload observability with various AWS services. In addition to the workshop, the following links also would be very useful to go into more details on ECS Anywhere.

HAQM ECS Anywhere official documentation
Introducing HAQM ECS Anywhere | Containers
Building an HAQM ECS Anywhere home lab with HAQM VPC network connectivity | Containers (My personal favorite one. Lovely clustered Raspberry Pis!)

Lastly, as always we are keen to hearing from you about what we should build next, in the Containers Roadmap GitHub repository. Feel free to create new GitHub issues and add comments to existing GitHub issues, and let us know your great ideas along with your use cases!

Containers