AWS Open Source Blog

Modernizing Snowflake Corporate’s Kubernetes Infrastructure with Bottlerocket and Karpenter

Snowflake Corporate IT Cloud Operations reached a critical juncture in its cloud infrastructure evolution. Managing large-scale containerized workloads on HAQM Elastic Kubernetes Service (HAQM EKS) demanded a modern, secure, and efficient operating system. The existing setup, running on HAQM Linux 2 (AL2), was functional but presented several challenges. Security hardening required frequent updates and patching, increasing operational overhead. Ensuring consistent and secure updates across a large fleet of nodes proved cumbersome. Additionally, boot times for AL2 nodes were slower, leading to inefficiencies in scaling. After thorough evaluation, Bottlerocket, AWS’s container-optimized operating system, emerged as the ideal candidate to address these challenges.

Migration Strategy

Transitioning from AL2 to Bottlerocket was more than just a technical shift; it was a strategic decision to future-proof Snowflake Corporate’s Kubernetes infrastructure. Given the scale and complexity of workloads, the migration was designed to ensure zero downtime, minimal disruptions, and seamless scaling through automation. To accomplish this, Snowflake Corporate selected Karpenter, an open source Kubernetes cluster autoscaler, along with NodePool and NodeClass, to facilitate dynamic node provisioning. The migration was executed in a phased manner to minimize risks and ensure stability.

Migration Steps

The migration began with cluster preparation. Bottlerocket AMIs were integrated into the EKS environment by modifying the NodePool and NodeClass configurations to use Bottlerocket as the default AMI family. AWS Identity and Access Management (IAM) policies were optimized to align with Bottlerocket’s security model, following the principle of least privilege.

This architectural diagram visualizes the migration strategy:

Bottlerocket EKS architecture diagram

Karpenter deployment replaced the traditional static provisioning approach, enabling just-in-time node provisioning. Workload validation followed, with the staging environment used to test workloads on Bottlerocket nodes before production rollout. Performance monitoring was implemented using Fluentd and Datadog to track real-time metrics, and security compliance tests ensured that Bottlerocket’s immutable infrastructure aligned with Snowflake Corporate’s security policies.

The rollout was then phased, starting with stateless applications. Node affinity, pod anti-affinity, and categories were used to ensure optimal workload distribution. A gradual introduction of Bottlerocket nodes ensured workloads transitioned smoothly alongside existing AL2 instances. Node cordoning and draining helped decommission AL2 instances without service interruptions.

Finally, enhanced monitoring and optimization were implemented. Automated scaling with Karpenter dynamically adjusted the cluster’s node pool. Performance tuning was conducted based on real-world workloads, and observability improvements provided insights into system health, allowing proactive issue resolution.

Example of defining a NodeClass and associating it with a NodePool:

apiVersion: karpenter.k8s.aws/v1alpha5
kind: NodeClass
metadata:
  name: bottlerocket-nodeclass
spec:
  amiFamily: Bottlerocket
  instanceProfile: "KarpenterNodeInstanceProfile"
  securityGroupSelector:
    aws-ids: ["sg-0123456789"]
Example of defining a NodePool:
apiVersion: karpenter.k8s.aws/v1alpha5
kind: NodePool
metadata:
  name: bottlerocket-nodepool
spec:
  template:
    spec:
      nodeClassRef:
        name: bottlerocket-nodeclass
  limits:
    resources:
      cpu: 1000
  ttlSecondsAfterEmpty: 30

Example of applying node affinity to schedule workloads on Bottlerocket nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bottlerocket-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: bottlerocket-app
  template:
    metadata:
      labels:
        app: bottlerocket-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: karpenter.k8s.aws/node-pool
                operator: In
                values:
                - bottlerocket-nodepool
      containers:
      - name: app
        image: my-app-image:latest
Example of using pod anti-affinity to spread workloads across different nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical-app
  template:
    metadata:
      labels:
        app: critical-app
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - critical-app
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: workload
        image: workload-image:latest

Challenges & How They Were Addressed

Despite the advantages of Bottlerocket, the migration process presented challenges. Some workloads initially experienced incompatibilities with Bottlerocket’s immutable filesystem. This was resolved by modifying application images to be fully container-compliant and leveraging read-only configurations where applicable. Bottlerocket required reconfiguring IAM roles to align with its security model, which was resolved by implementing fine-grained access controls and leveraging Karpenter’s IAM integration. To mitigate risks, workloads were migrated incrementally, ensuring that application performance remained stable before fully decommissioning AL2 nodes.

Key Benefits

The migration delivered substantial improvements in security, performance, and operational efficiency. Security was enhanced with immutable nodes preventing unauthorized changes and eliminating configuration drift. The reduced attack surface, due to the removal of package managers, shell access, and SSH, reduced vulnerabilities. Automated, atomic updates ensured nodes remained securely patched without downtime.

Faster node boot times were achieved with optimized node startup, reducing the time required for new nodes to join the cluster, and improved autoscaling efficiency ensured workloads were rescheduled quickly. Operational efficiency was improved with dynamic scaling by Karpenter, ensuring resources were provisioned only when needed, avoiding over-provisioning. Cost savings were realized through Bottlerocket’s lightweight OS and Karpenter’s intelligent provisioning.

Performance Gains: Bottlerocket vs. AL2

Bottlerocket consistently demonstrated faster node readiness. Preliminary benchmarks showed that Bottlerocket reduced node readiness time by approximately 5 seconds compared to AL2. The native container image caching shaved off about 36 seconds per pod on a fresh node, making unschedulable pods approximately 40 seconds faster compared to AL2.

Bottlerocket vs AL2 performance chart

Security Enhancements: AL2 vs. Bottlerocket

A direct comparison of security improvements highlights why Bottlerocket was the superior choice:

Feature comparison chart

Lessons Learned

The migration taught valuable lessons. Security and efficiency go hand-in-hand, with Bottlerocket’s immutable design strengthening Snowflake Corporate’s security posture. Automation simplified complexity, as Karpenter’s real-time scaling eliminated manual interventions. Incremental migration minimized risk, and phased rollouts allowed for fine-tuning configurations without production impact.

Conclusion: Broader Implications for Enterprises Running EKS at Scale

The successful migration of Snowflake Corporate’s Kubernetes infrastructure to Bottlerocket and Karpenter creates a new model for the industry to follow. The benefits of enhanced security, faster provisioning, and operational efficiency can be replicated across other enterprises managing Kubernetes at scale. Future enhancements could include AI-driven workload scheduling, deeper integration with observability tools, and exploring serverless Kubernetes with Bottlerocket. By adopting Bottlerocket and Karpenter, Snowflake Corporate not only enhanced its security posture but also achieved performance improvements through dynamic scaling, underscoring the power of modern cloud-native solutions in enabling high-performance, resilient Kubernetes environments.

Sameeksha Garg

Sameeksha Garg

Sameeksha is a Technical Account Manager at AWS committed to accelerate the cloud journey for AWS Global Enterprise customers. She has 7+ years of industry experience across cloud security, cloud operations, cloud infrastructure management and customer advocacy. She is passionate about cloud security technologies and strives to help customers secure their workloads in the cloud.

Gaurav Singodia

Gaurav Singodia

Gaurav Singodia is a high-tech engineering leader at Snowflake with a proven track record of driving innovation and growth through an entrepreneurial mindset. He currently leads a diverse global organization encompassing SRE, Systems Engineers, Software Engineers, Data Infrastructure, Identity Platform, AI/ML, and Analytics, with a strong focus on maintaining high quality and achieving scalability across all domains.

Jagdish Pawar

Jagdish Pawar

Jagdish Pawar has over 18 years of leadership experience across technology startups, growth-stage companies, and public corporations. His expertise includes building and leading cross-functional teams, product management, engineering, and managing reliable, secure, and massively scalable cloud operations.

RK Sai (Ravikiran Koduri)

RK Sai (Ravikiran Koduri)

RK Sai (Ravikiran Koduri) is an Enterprise Support Lead at AWS. As a technical advisor, he helps Independent Software Vendors (ISVs) operationalize workloads at scale. RK Sai is an evangelist for AWS Deep Racer, AI, and Cloud Financial Management services. In his free time, he strives to concretize an abstract sense of fulfillment.

Sayan Moitra

Sayan Moitra

Sayan Moitra is a Senior DevOps engineer who specializes in cloud engineering, DevOps, and SRE, specializing in deploying infrastructure and applications. He holds multiple AWS certifications and CKAD, with recognized expertise in serverless computing. He's passionate about continuous learning and solving complex problems.