How VMware Tanzu CloudHealth modernized container workloads from self-managed Kubernetes to HAQM Elastic Kubernetes Service

This post is co-written with Rivlin Pereira, Staff DevOps Engineer at VMware

Introduction

VMware Tanzu CloudHealth is the cloud cost management platform of choice for more than 20,000 organizations worldwide that rely on it to optimize and govern the largest and most complex multi-cloud environments. In this post, we will talk about how VMware Tanzu CloudHealth migrated their container workloads from self-managed Kubernetes on HAQM EC2 to HAQM Elastic Kubernetes Service (HAQM EKS). We will discuss lessons learned and how migration help achieve eventual goal of making cluster deployments fully automated with a one-click solution, scalable, secure, and reduce overall operational time spent to manage these clusters. This migration led them to scale their production cluster footprint from 2400 pods running in kOps (short for Kubernetes Operations) cluster on HAQM Elastic Compute Cloud (HAQM EC2) to over 5200 pods on HAQM EKS. HAQM EKS cluster footprint has also grown from running a few handful of clusters after the migration to 10 clusters in total across all environments and growing.

HAQM EKS is a managed Kubernetes service to run Kubernetes in the AWS cloud and on-premises data centers. In the cloud, HAQM EKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks.

Previous self-managed K8s clusters and related challenges

The self-managed Kubernetes clusters were deployed using kOps. These clusters required significant Kubernetes operational knowledge and maintenance time. While the clusters were quite flexible, the VMware Tanzu CloudHealth DevOps team was responsible for the inherent complexity, including custom HAQM Machine Image (AMI) creation, security updates, upgrade testing, control-plane backups, cluster upgrades, networking, and debugging. The clusters grew significantly and encountered limits that were not correctable without significant downtime and that is when the team considered moving to managed solution offering.

Key drivers to move to managed solution

VMware Tanzu CloudHealth DevOps team had following requirements for HAQM EKS clusters:

Consistently reproducible and deployed with a one-click solution for automated Infrastructure-as-Code (IaC) deployment across environments.
Consistent between workloads.
Deployable in multiple regions.
Services can migrate from the old clusters to the new clusters with minimal impact.
New clusters provide more control over roles and permissions.
New cluster lifecycle (i.e., creation, on-boarding users, and cluster upgrades) tasks reduce operational load.

Key technical prerequisite evaluation

We will discuss couple of technical aspects customers should evaluate in order to avoid surprises during the migration. HAQM EKS uses upstream Kubernetes, therefore, applications that run on Kubernetes should natively run on HAQM EKS without the need for modification. Here are some key technical considerations discussed in Migrating from self-managed Kubernetes to HAQM EKS? Here are some key considerations post that VMware team evaluated and implemented required changes:

Kubernetes versions:
- VMware team was running k8s version 1.16 on kOps. For the HAQM EKS migration, team started with k8s version 1.17 and post migration they have upgraded to 1.24.
Security:
- Authentication for HAQM EKS cluster: kOps clusters were configured to use Google OpenID for identity and authentication. HAQM EKS supports both OpenID Connect (OIDC) identity providers and AWS Identity and Access Management (AWS IAM) as methods to authenticate users to your cluster. To take advantage of HAQM EKS support for AWS IAM for identity and authentication, VMware made user configuration and authentication workflow changes to access the new clusters. Please see Updating kubeconfig for more information.
- AWS IAM roles for service accounts: VMware had configured AWS IAM roles for pod using kube2iam for kOps self-managed clusters. With this setup, pod level permissions were granted by IAM via proxy agent that was required to be run on every node. This kOps setup resulted in issues at scale. HAQM EKS enables a different approach. AWS Permissions are granted directly to pods by service account via a mutating webhook on the control plane. Communication for identity, authentication and authorization happens only with the AWS API endpoints and Kubernetes API, eliminating any proxy agent requirement. Review Introducing fine-grained IAM roles for service accounts for more information. The migration to IAM roles for service accounts for HAQM EKS fixed issues encountered with kube2iam when running at larger scales and has other benefits:
  - Least privilege: By using the IAM roles for service accounts feature, they are no longer needed to provide extended permissions to the worker node IAM role so that pods on that node can call AWS APIs. You can scope IAM permissions to a service account, and only pods that use that service account have access to those permissions.
  - Credential isolation: A container can only retrieve credentials for the IAM role that is associated with the service account to which it belongs. A container never has access to credentials that are intended for another container that belongs to another pod.
  - Auditability: Access and event logging is available through AWS CloudTrail to help ensure retrospective auditing.
Networking: VMware had setup kOps clusters using Calico as an overlay network. In HAQM EKS, they decided to implement HAQM VPC CNI plugin for K8s as it assigns IPs from the VPC classless interdomain routing (CIDR) to each pod. This is accomplished by adding a secondary IP to the EC2 nodes elastic network interface. Each HAQM EC2 node type has a supported number of elastic network interfaces (ENI) and corresponding number of secondary IPs assignable per ENI. Each EC2 instance starts with a single ENI attached and will add ENIs as required by pod assignment.
- VPC and subnet sizing: VMware created a single VPC with /16 CIDR range in production to deploy HAQM EKS cluster. For development and staging environments, they created multiple HAQM EKS clusters in single VPC with /16 CIDR to save on IP space. For each VPC, private and public subnets were created, and HAQM EKS clusters were created in private subnet. NAT gateway was configured for outbound public access. Also, subnets were appropriately tagged for internal use.
Tooling to create HAQM EKS clusters: VMware reviewed AWS recommended best practices for cluster configuration. For cluster deployment, a common practice is IaC and there are several options like CloudFormation, eksctl, the official CLI tool of HAQM EKS, AWS Cloud Development Kit (CDK) and third-party solutions like Terraform. They decided to automate the deployment of the HAQM EKS cluster using a combination of community Terraform modules and some Terraform modules were developed in-house. Customers can also check HAQM EKS blueprints for cluster creation.
HAQM EKS Node Groups (Managed/Unmanaged): HAQM EKS allows for use of both managed and self-managed node groups. Managed node groups offer significant advantages at no extra cost. This includes offloading of OS updates and security patching by using HAQM EKS optimized AMI where HAQM EKS is responsible for building patched versions of the AMI when bugs or issues are reported. HAQM EKS follows the shared responsibility model for Common Vulnerability and Exposures (CVE) and security patches on managed node groups, its customers responsibility for deploying these patched AMI versions to managed node groups. Other features of managed node groups include, automatic node draining via the Kubernetes API during terminations or updates, respect the pod disruption budgets, and automatic labeling to enable Cluster Autoscaler. Unless there is a specific configuration that cannot be fulfilled by a managed node group, recommendation is to use managed node group. Please note that cluster-autoscaler is not enabled for you by default on HAQM EKS and has to be deployed by the customer. VMware used managed node groups for migration to HAQM EKS.

Solution overview

Migration execution

With an architecture defined, the next step was to create the AWS infrastructure and execute the migration of workloads from the self-managed Kubernetes clusters to HAQM EKS.

Using IaC a parallel set of environments was provisioned for the HAQM EKS clusters alongside the existing kOps infrastructure. This would allow any changes necessary to be made to the Kubernetes manifests while retaining the capability to deploy changes to the existing infrastructure as needed.

Pre Cut-over

Figure a. Pre Cut-over

Walkthrough

Once the infrastructure was provisioned, changes were made to the manifests to align with HAQM EKS 1.17 and particular integrations that would be required. For example, the annotations to enable IRSA were added alongside the existing kube2iam metadata to allow the workloads to be deployed in both sets of infrastructure in parallel.

Kube2iam on the kOps cluster provided AWS credentials via traffic redirect from the HAQM EC2 metadata API for docker containers to a container running on each instance, making a call to the AWS API to retrieve temporary credentials and return these to the caller. This function was enabled via an annotation on the pod specifications.

kind: Pod
metadata:
  name: aws-cli
  labels:
    name: aws-cli
  annotations:
    iam.amazonaws.com/role: role-arn
    iam.amazonaws.com/external-id: external-id

To configure a pod to use IAM roles for service accounts, the service account was annotated instead of pod.

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::AWS_ACCOUNT_ID:role/IAM_ROLE_NAME

After testing was performed in a pre-production environment, the workloads were promoted to the production environment where further validation testing was completed.

At this stage, it was possible to start routing traffic to the workloads running on HAQM EKS. In the case of this particular architecture this was accomplished by re-configuring the workloads consuming the APIs to incrementally route a certain percentage of traffic to the new API endpoints running on HAQM EKS. This would allow the performance characteristics of the new infrastructure to be validated gradually as the traffic increased, as well as the ability to rapidly roll back the change in the event of issues being encountered.

Partial Cut-over

Figure b. Partial Cut-over

As production traffic was entirely routed to the new infrastructure and confidence was established in the stability of the new system the original kOps clusters could be decommissioned, and the migration completed.

Full Cut-over

Figure c. Full Cut-over

Lessons learned

The following takeaways can be learned from this migration experience:

Adequately plan for heterogeneous worker node instance types. VMware started with a memory-optimized HAQM EC2 instance family for their cluster node-group, but as the number of workloads run on HAQM EKS diversified, along with their compute requirements, it became clear that needed to offer other instance types. This led to dedicated node-groups for specific workload profiles (e.g., for compute heavy workloads). This has led VMware to investigate Karpenter, an open-source, flexible, high-performance Kubernetes cluster autoscaler built by AWS. It helps improve application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load.
Design VPCs to match the requirements of HAQM EKS networking. The initial VPC architecture implemented by VMware was adequate to allow the number of workloads on the cluster to grow, but over time the number of available IPs became constrained. This was resolved by monitoring the available IPs and configuring the VPC CNI with some optimizations for their architecture. You can review the recommendations for sizing VPCs for HAQM EKS in the best practices guide.
As HAQM EKS clusters grow, optimizations will likely have to be made to core Kubernetes and third-party components. For example VMware had to optimize the configuration of Cluster Autoscaler for performance and scalability as the number of nodes grew. Similarly it was necessary to leverage NodeLocal DNS to reduce the pressure on CoreDNS as the number of workloads and pods increased.
Using automation and infrastructure-as-code is recommended, especially as HAQM EKS cluster configuration becomes more complex. VMware took the approach of provisioning the HAQM EKS clusters and related infrastructure using Terraform, and ensured that HAQM EKS upgrade procedures were considered.

Conclusion

In this post, we walked you through how VMware Tanzu CloudHealth (formerly CloudHealth), migrated their container workloads from self-managed Kubernetes clusters running on kOps to AWS managed HAQM EKS with the eventual goal of making cluster deployments fully automated with a one-click solution, scalable, secure solution that reduced the overall operational time spend to manage these clusters. We walked you through important technical pre-requisites to be considered for migration to HAQM EKS, some challenges that were encountered either during or after migration, and lessons learned. We encourage to evaluate HAQM EKS for migrating workloads from kOps to a managed offering.

Rivlin Pereira, VMware Tanzu Division

Rivlin Pereira is Staff DevOps Engineer at VMware Tanzu Division. He is very passionate about Kubernetes and works on CloudHealth Platform building and operating cloud solutions that are scalable, reliable and cost effective.

Containers