How AWS and Intel make LLMs more accessible and cost-effective with DeepSeek

By Anish Kumar, AI Software Engineering Manager – Intel
By Dylan Souvage, Solutions Architect – AWS
By Vishwa Gopinath Kurakundi, Solutions Architect – AWS

Enterprises are seeking efficient ways to implement Large Language Models (LLMs). Enterprises want to leverage LLMs but need solutions that balance performance with cost efficiency.

During the recent AWS re:Invent conference in Las Vegas, HAQM CEO Andy Jassy shared three valuable lessons learned by HAQM’s own experience building over 1000 GenAI applications.

Cost efficiency at scale is crucial for GenAI applications
Building effective GenAI applications requires careful consideration
Model diversity is essential – there isn’t a one-size-fits-all solution

These insights guide how AWS approaches GenAI implementation with our customers. At AWS, we recognize that flexibility and choice matters to our customers. Andy Jassy highlighted how our wide range of LLMs lets customers find the perfect tools for their specific needs. Thanks to deep collaboration with AWS Partners such as Intel®, AWS continuously expands its curated LLM offerings, increasing customer access.

Intel and AWS

The collaboration between AWS and Intel dates back to 2006 when we launched HAQM Elastic Compute Cloud (EC2) featuring Intel’s chips. Over 19 years, this collaboration has grown to deliver cloud services that optimize costs, simplify operations, and meet evolving business needs. Intel processors provide the foundation of many cloud computing services deployed on AWS. HAQM EC2 instances powered by Intel® Xeon® Scalable processors have the largest breadth, global reach, and availability of compute instances across AWS regions. In September 2024, AWS and Intel announced a co-investment in custom chip designs under a multi-year, multi-billion-dollar framework covering product and wafers from Intel. This is an expansion of the two companies’ longstanding collaboration to help customers power virtually any workload and speed up the performance of artificial intelligence (AI) applications.

DeepSeek

AWS and Intel are working together to make Large Language Models (LLMs) more accessible and cost-effective for enterprises. Distilled language models, which maintain high performance while requiring fewer resources, are emerging as a practical approach. These distilled models maintain high performance while running directly on CPUs – also known as Small Language Models (SLMs). Running small language model (SLM) training and inferencing on CPUs unlocks performant AI within time and cost constraints. DeepSeek’s (who released DeepSeek-R1) models, are rapidly gaining popularity since their launch, for their efficiency, cost-effectiveness, and open-source licensing, which allows them to be deployed freely in applications. Further, DeepSeek offers distilled versions of their models. These smaller (student) models are trained to generate responses that match the quality of a larger (teacher) model while requiring fewer resources.

HAQM EC2 is a cost-effective platform to deploy LLMs, and it provides specialized instances that run on Intel® Xeon® Scalable Processors, suitable for deploying optimized models like distilled DeepSeek-R1. Intel® Xeon® 4th Generation and above CPUs feature Advanced Matrix Extensions (AMX) accelerators that significantly boost LLM workload performance by turbocharging matrix multiplication operations, which are fundamental to LLM inference. These AMX accelerators deliver throughput gains while integrating with open standards like oneAPI, providing enterprises with cost-effective and scalable solutions for deploying Generative AI applications with faster time-to-insight and reduced total cost of ownership (TCO).

HAQM EC2 also provides great flexibility and scalability as it supports various deployment configurations, including virtual Large Language Model (vLLM) that can seamlessly integrate with Docker-based Hugging Face hub. In this companion blog post we will take a step-by-step look at how you can quickly deploy a DeepSeek-R1-Distill-Llama-8B model on HAQM EC2 m7i.2xlarge instance, which uses Intel® Xeon® Scalable processors with 8 vCPUs and 32GB of memory. The blog provides detailed information on configuring HAQM EC2 to deploy the model. Further building the vLLM docker container for CPU which includes Intel’s optimization for CPU including the Intel Extension for PyTorch. This extension ensures the LLM inferences are optimized to run on Intel® Xeon® 4th Generation or above processors, and wraps up with testing the inference once the model is deployed.

Conclusion

Businesses can deploy custom and open-source LLMs, including Distilled DeepSeek-R1, on AWS using either managed services HAQM Bedrock and HAQM SageMaker or on HAQM EC2, adapting to their specific business requirements. The collaboration between AWS and Intel advances the Generative AI landscape, combining Intel’s semiconductor technology with AWS’s cloud infrastructure to deliver accessible, cost-effective AI solutions.

For more about AWS in the Generative AI space, visit our Machine Learning blog.

Intel – AWS Partner Spotlight

Intel and HAQM Web Services (AWS) have collaborated for over 19 years to develop flexible technologies and software optimizations tailored for critical enterprise workloads. This collaboration allows our AWS Partners to help their customers migrate and modernize their applications and infrastructure to manage cost and complexity, accelerate business outcomes, and scale to meet current and future computing requirements.
Contact Intel | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog

How AWS and Intel make LLMs more accessible and cost-effective with DeepSeek

Intel and AWS

DeepSeek

Conclusion

Intel – AWS Partner Spotlight

Resources

Follow