AWS Open Source Blog
Modernizing Snowflake Corporate’s Kubernetes Infrastructure with Bottlerocket and Karpenter
Snowflake Corporate IT Cloud Operations reached a critical juncture in its cloud infrastructure evolution. Managing large-scale containerized workloads on HAQM Elastic Kubernetes Service (HAQM EKS) demanded a modern, secure, and efficient operating system. The existing setup, running on HAQM Linux 2 (AL2), was functional but presented several challenges. Security hardening required frequent updates and patching, increasing operational overhead. Ensuring consistent and secure updates across a large fleet of nodes proved cumbersome. Additionally, boot times for AL2 nodes were slower, leading to inefficiencies in scaling. After thorough evaluation, Bottlerocket, AWS’s container-optimized operating system, emerged as the ideal candidate to address these challenges.
Migration Strategy
Transitioning from AL2 to Bottlerocket was more than just a technical shift; it was a strategic decision to future-proof Snowflake Corporate’s Kubernetes infrastructure. Given the scale and complexity of workloads, the migration was designed to ensure zero downtime, minimal disruptions, and seamless scaling through automation. To accomplish this, Snowflake Corporate selected Karpenter, an open source Kubernetes cluster autoscaler, along with NodePool and NodeClass, to facilitate dynamic node provisioning. The migration was executed in a phased manner to minimize risks and ensure stability.
Migration Steps
The migration began with cluster preparation. Bottlerocket AMIs were integrated into the EKS environment by modifying the NodePool and NodeClass configurations to use Bottlerocket as the default AMI family. AWS Identity and Access Management (IAM) policies were optimized to align with Bottlerocket’s security model, following the principle of least privilege.
This architectural diagram visualizes the migration strategy:
Karpenter deployment replaced the traditional static provisioning approach, enabling just-in-time node provisioning. Workload validation followed, with the staging environment used to test workloads on Bottlerocket nodes before production rollout. Performance monitoring was implemented using Fluentd and Datadog to track real-time metrics, and security compliance tests ensured that Bottlerocket’s immutable infrastructure aligned with Snowflake Corporate’s security policies.
The rollout was then phased, starting with stateless applications. Node affinity, pod anti-affinity, and categories were used to ensure optimal workload distribution. A gradual introduction of Bottlerocket nodes ensured workloads transitioned smoothly alongside existing AL2 instances. Node cordoning and draining helped decommission AL2 instances without service interruptions.
Finally, enhanced monitoring and optimization were implemented. Automated scaling with Karpenter dynamically adjusted the cluster’s node pool. Performance tuning was conducted based on real-world workloads, and observability improvements provided insights into system health, allowing proactive issue resolution.
Example of defining a NodeClass and associating it with a NodePool:
Challenges & How They Were Addressed
Despite the advantages of Bottlerocket, the migration process presented challenges. Some workloads initially experienced incompatibilities with Bottlerocket’s immutable filesystem. This was resolved by modifying application images to be fully container-compliant and leveraging read-only configurations where applicable. Bottlerocket required reconfiguring IAM roles to align with its security model, which was resolved by implementing fine-grained access controls and leveraging Karpenter’s IAM integration. To mitigate risks, workloads were migrated incrementally, ensuring that application performance remained stable before fully decommissioning AL2 nodes.
Key Benefits
The migration delivered substantial improvements in security, performance, and operational efficiency. Security was enhanced with immutable nodes preventing unauthorized changes and eliminating configuration drift. The reduced attack surface, due to the removal of package managers, shell access, and SSH, reduced vulnerabilities. Automated, atomic updates ensured nodes remained securely patched without downtime.
Faster node boot times were achieved with optimized node startup, reducing the time required for new nodes to join the cluster, and improved autoscaling efficiency ensured workloads were rescheduled quickly. Operational efficiency was improved with dynamic scaling by Karpenter, ensuring resources were provisioned only when needed, avoiding over-provisioning. Cost savings were realized through Bottlerocket’s lightweight OS and Karpenter’s intelligent provisioning.
Performance Gains: Bottlerocket vs. AL2
Bottlerocket consistently demonstrated faster node readiness. Preliminary benchmarks showed that Bottlerocket reduced node readiness time by approximately 5 seconds compared to AL2. The native container image caching shaved off about 36 seconds per pod on a fresh node, making unschedulable pods approximately 40 seconds faster compared to AL2.
Security Enhancements: AL2 vs. Bottlerocket
A direct comparison of security improvements highlights why Bottlerocket was the superior choice:
Lessons Learned
The migration taught valuable lessons. Security and efficiency go hand-in-hand, with Bottlerocket’s immutable design strengthening Snowflake Corporate’s security posture. Automation simplified complexity, as Karpenter’s real-time scaling eliminated manual interventions. Incremental migration minimized risk, and phased rollouts allowed for fine-tuning configurations without production impact.
Conclusion: Broader Implications for Enterprises Running EKS at Scale
The successful migration of Snowflake Corporate’s Kubernetes infrastructure to Bottlerocket and Karpenter creates a new model for the industry to follow. The benefits of enhanced security, faster provisioning, and operational efficiency can be replicated across other enterprises managing Kubernetes at scale. Future enhancements could include AI-driven workload scheduling, deeper integration with observability tools, and exploring serverless Kubernetes with Bottlerocket. By adopting Bottlerocket and Karpenter, Snowflake Corporate not only enhanced its security posture but also achieved performance improvements through dynamic scaling, underscoring the power of modern cloud-native solutions in enabling high-performance, resilient Kubernetes environments.