March 11, 2025
9 min read

The Complete Guide to Friendli Container AWS EKS Add-On

Maximize Your Gen AI Inference with Friendli Container on AWS EKS

Are you an enterprise using AWS and looking to optimize your generative AI inference at scale? Look no further than the Friendli Container AWS EKS Add-On. Adding this add-on instantly integrates Friendli Container into your Amazon EKS workflow, with the convenience of consolidated AWS billing. It unlocks reduced inference costs, faster scaling, and improved throughput for your workloads.

Read on to discover how easy it is to set up and unleash the full potential of this powerful microservice for your enterprise.

Friendli Container: The Ultimate Inference Supercharger

The Friendli Container is a Docker image designed to bring our cutting-edge Friendli Inference solution into your environment. It provides a microservice-based container that incorporates key optimizations from our managed service, allowing you to leverage the fastest AI inference engine in the market, tailored to work seamlessly within your setup. While it doesn’t include all the optimizations from our managed service, it brings some of the important performance-enhancing features for high-performance inference in your infrastructure.

Optimized to reduce latency, minimize GPU usage, and maximize cost-efficiency, the Friendli Container provides scalable, isolated environments for AI model deployment, helping you achieve superior performance.

Over 50% reduction in GPU usage
Over 2x lower latency
Over 2x higher throughput

While Friendli Container unlocks unprecedented power, to truly harness its full capabilities, it's essential to have a supporting infrastructure that efficiently manages GPU resources and orchestrates operations.

Amazon EKS: Simplifying Kubernetes Operations

Kubernetes (K8S) is the de facto industry standard for managing containerized applications, enabling businesses to deploy, scale, and manage workloads across environments. With its powerful features like automated scaling, load balancing, and self-healing, Kubernetes simplifies the management of complex applications at scale. However, managing Kubernetes efficiently requires deep understanding and expertise. This is where Amazon EKS steps in.

Amazon EKS is a fully managed service that simplifies the deployment, management, and scaling of containerized applications using Kubernetes on AWS. EKS eliminates the complexity of Kubernetes cluster management, offering a secure, scalable, and highly available platform for running containerized workloads. Moreover, it integrates seamlessly with other AWS services, providing a comprehensive solution for orchestrating containers in the cloud.

Thus, many organizations have adopted Amazon EKS for scalable generative AI inference, including:

Adobe, a leading digital creativity SaaS company, built its generative AI solution, Adobe Firefly, using Amazon EKS.
Mobileye, an autonomous driving technology company, leverages Amazon EKS for computer vision and AI applications.
Omi, a startup providing AI-powered 3D rendering solutions, utilizes Amazon EKS to fuel its generative AI models.

Key Benefits of Amazon EKS:

Fully Managed Kubernetes: AWS takes care of the Kubernetes control plane, removing the need for manual setup and maintenance. This allows teams to focus on applications rather than infrastructure.
Seamless AWS Integration: EKS integrates smoothly with AWS services like EC2, IAM, S3, and CloudWatch, enabling you to easily enhance your applications with the full range of AWS features.
Scalability and Flexibility: EKS automatically scales your cluster and workloads based on demand. It supports running applications across multiple AWS Availability Zones, ensuring high availability and resilience.
Enhanced Security: EKS benefits from AWS's security infrastructure, offering built-in encryption, IAM roles, and network policies to control access and protect your applications.

Kubernetes in action

Figure 1: A Kubernetes cluster in action. Reference: Amazon EKS. [Online] Available: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-concepts.html [Accessed Feb. 26, 2025]

In short, AWS EKS simplifies Kubernetes management, letting you focus on what matters most — building great applications.

Why Deploy Friendli Container on AWS EKS?

If you’re looking to slash costs and boost performance immediately, deploying Friendli Container as an EKS add-on lets you do just that—right within your existing EKS workflow. Here's how:

Instant Cost Savings: Friendli Container leverages proprietary technologies to reduce inference GPU costs by over 50%, maximizing ROI and delivering exceptional performance.
Streamlined Billing: Simplify your accounting with consolidated billing. All AWS services, including the Friendli Container add-on, are grouped into a single invoice for easy tracking and budgeting.
Effortless Subscription: AWS handles subscriptions for you, ensuring minimal administrative overhead.
Automated Updates: Regular updates to Friendli Container add-on are automatically applied to keep your system secure and optimized, eliminating the need for manual intervention.

By deploying Friendli Container on AWS EKS, you can quickly and easily enhance your Generative AI workflows with a secure, scalable, and cost-efficient platform that ensures immediate cost savings.

How to Use Friendli Container on AWS EKS

We will walk you through setting up an EKS cluster, deploying Friendli Container, and provide the expected output for each step. By following these steps, you will have a working inference service successfully deployed on your EKS cluster.

EKS addon setup workflow

Figure 2: Friendli Container EKS Add-On Setup Workflow.

1. Prerequisite: Add GPU Node Group to your EKS Cluster

Before proceeding, ensure you have an active AWS EKS cluster. If you haven't created one yet, please follow the AWS EKS documentation to set up your EKS cluster.

If you have already added a GPU Node Group to your EKS cluster, you can skip this part.

Click to expand how to add GPU Node Group to your EKS cluster

Friendli Container EKS Add-on requires Kubernetes version 1.28 or later.

When selecting the AWS region for your new EKS cluster, the availability of GPU instances is one of the key factors to consider. As of February 2025, Friendli Container supports NVIDIA H100, A100, A10G, and L4 devices. You can check the instance availability here.

NVIDIA Device	AWS EC2 Instance Type
H100	p5.48xlarge
A100	p4d.24xlarge
A10G	g5.xlarge g5.2xlarge g5.4xlarge g5.8xlarge g5.16xlarge g5.12xlarge g5.24xlarge g5.48xlarge
L4	g6.xlarge g6.2xlarge g6.4xlarge g6.8xlarge g6.16xlarge gr6.4xlarge gr6.8xlarge g6.12xlarge g6.24xlarge g6.48xlarge

Figure 3: AWS EC2 Instances for Supported NVIDIA GPUs (H100, A100, A10G, L4).

If you’re going to use multi-GPU VM instance types, installing the NVIDIA GPU Operator is highly recommended for proper resource management. You can consult the guide from NVIDIA GPU Operator, and an example of installing a GPU operator using helm can be found here.

Now let’s add GPU Node Group to your EKS cluster.

Open Amazon EKS console and choose the cluster that you want to create a node group in.
Select the “Compute” tab and click “Add node group”.
Configure the new node group by entering the name, Node IAM role, and other information. You can click “Create recommended role” to create IAM role. Click “Next”.
On the next page, select “Amazon Linux 2023 (x86_64) Nvidia” for AMI type.
Select the appropriate instance type for the GPU device of your choice.
Configure the disk size. It should be large enough to download the model you want to deploy. (To execute the example in this guide, it is recommended to set the size of the disk to 60GB.)
Configure the desired node group size.
Go through the rest of the steps, review the changes and click “Create”.

2. Configure Friendli Container EKS add-on

Open Amazon EKS console and choose the cluster that you want to configure.
Select the “Add-ons” tab and click “Get more add-ons”.
Scroll down and under the section “AWS Marketplace add-ons”, search and check “Friendli Container”, and click “Next”.
Now you’ll need an active subscription to Friendli Container. The number of license units you need to purchase is determined by the number of GPU devices you want to use for running Friendli Container.
Click “Next”, Review your settings, and click “Create”

For the details of the pricing, check Friendli Container on AWS Marketplace. For trials, custom offers, and inquiries, please visit here for contacts.

Now you need to allow Kubernetes ServiceAccounts to contact AWS License Manager, so that your Friendli Inference Deployments can be activated properly.

Before you continue, please make sure “Amazon EKS Pod Identity Agent” EKS add-on is installed in your cluster. You can click “Get more add-ons” and enable “Amazon EKS Pod Identity Agent” under the “AWS add-ons” section.

Open Amazon EKS console and choose the cluster that you want to configure.
Select the “Access” tab.
Under the “Pod Identity associations” section, click “Create”.

“Create Pod Identity association” page will appear. Now let’s configure the IAM role, Kubernetes namespace, and Kubernetes service account.

IAM Role
- Click “Create recommended role”.
- In step 1 (Select trusted entity), “EKS - Pod Identity” should be selected for the use case. Leave it as is and click “Next”.
- In step 2 (Add permissions), search for “AWSLicenseManagerConsumptionPolicy” and enable it. Click “Next”.
- In step 3 (Name, review, and create), give the appropriate Role name and click “Create”.
- Go back to the “Create Pod Identity association” page and select the IAM role you just created.
Kubernetes namespace.
- This is the Kubernetes namespace where you want to create Friendli Inference Deployments. When in doubt, you can use “default”.
- Later on, if you are going to create Friendli Inference Deployments in another namespace, you should create the Pod Identity association for that namespace.
Kubernetes service account.
- For most cases, this should be “default”.
- Later on, if you are going to configure Friendli Inference Deployments to use custom service accounts, you should create the Pod Identity association for that service account.

Click “Create”, then under the “Pod Identity associations” section, you should be able to see the association you just created.

Figure 4: The “Pod Identity associations” Section.

3. Create Friendli Deployment

You need to be able to use the “kubectl” CLI tool to access your EKS cluster. Consult this guide from AWS for more details.

To deploy a private or gated model in the HuggingFace model hub, you need to create a HuggingFace access token with “read” permission. Then create a Kubernetes secret.

shell
kubectl create secret generic hf-secret --from-literal token=YOUR_TOKEN_HERE

FriendliDeployment is Kubernetes custom resource that lets you easily create Friendli Inference Deployments without configuring Kubernetes low-level resources like pods, services, and deployments.

Below is a sample FriendliDeployment to deploy Meta Llama 3.1 8b on one g6.2xlarge instance.

yaml
apiVersion: friendli.ai/v1alpha1
kind: FriendliDeployment
metadata:
  namespace: default
  name: friendlideployment-sample
spec:
  model:
    huggingFace:
      repository: meta-llama/Llama-3.1-8B-Instruct

      # "token:" section is not needed if the model is
      # a public one.
      token:
        name: hf-secret
        key: token

  resources:
    nodeSelector:
      # Use the name of the node group you want to use.
      eks.amazonaws.com/nodegroup: gpu-l4

    numGPUs: 1
    requests:
      cpu: '6'
      ephemeral-storage: 30Gi
      memory: 25Gi
    limits:
      cpu: '6'
      ephemeral-storage: 30Gi
      memory: 25Gi
  deploymentStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  service:
    inferencePort: 6000

You can modify this YAML file for your use case.

The “token:” section under spec.model.huggingFace refers to the Kubernetes secret you created for storing the HuggingFace access token. If accessing your model does not require an access token, you can omit the “token:” section entirely.
In the example above, nodeSelector is “eks.amazonaws.com/nodegroup: gpu-l4”. This assumes that the name of the GPU node group is “gpu-l4”. You need to edit the node selector to match the name of your node group.
CPU and memory resource requirements are adjusted to g6.2xlarge instance and you may need to edit those values if you used different instance type.

If your cluster has NVIDIA GPU Operator installed, you need to put “nvidia.com/gpu” resource in “requests:” and “limits:” section, as GPU nodes will advertise that they have “nvidia.com/gpu” resource alongside ordinary resources like “cpu” and “memory”. You can omit “numGPUs” from your FriendliDeployment. Below is the equivalent example as above for the GPU Operator-enabled cluster.

yaml
resources:
  nodeSelector:
    eks.amazonaws.com/nodegroup: gpu-l4
  requests:
    cpu: '6'
    ephemeral-storage: 30Gi
    memory: 25Gi
    nvidia.com/gpu: '1'
  limits:
    cpu: '6'
    ephemeral-storage: 30Gi
    memory: 25Gi
    nvidia.com/gpu: '1'

Save your YAML file as “friendlideployment.yaml”, and execute “kubectl apply -f friendlideployment.yaml”.


$ kubectl apply -f friendlideployment.yaml
friendlideployment.friendli.ai/friendlideployment-sample created

$ kubectl get pods
NAME                                         READY   STATUS    RESTARTS   AGE
friendlideployment-sample-7d7b877c77-zjgqq   2/2     Running   0          3m18s

$ kubectl get services
NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
friendlideployment-sample   ClusterIP   172.20.95.224   <none>        6000/TCP   18m
kubernetes                  ClusterIP   172.20.0.1      <none>        443/TCP    28h

Now you can port-forward to the service to connect to the service from your PC.


$ kubectl port-forward svc/friendlideployment-sample 6000
Forwarding from 127.0.0.1:6000 -> 6000
Forwarding from [::1]:6000 -> 6000

In another terminal, use the curl tool to send an inference request.


$ curl http://localhost:6000/v1/completions -H 'Content-Type: application/json' --data-raw '{"prompt": "Hi!", max_tokens: 10, stream: false}'
{"choices":[{"finish_reason":"length","index":0,"seed":15349211611234757311,"text":" I'm Alex, and I'm excited to share","tokens":[358,2846,8683,11,323,358,2846,12304,311,4430]}],"id":"cmpl-b2e4b4cba711448c847ab89d763588da","object":"text_completion","usage":{"completion_tokens":10,"prompt_tokens":3,"total_tokens":13}}

For more information about Friendli Container usage, check our documentation and contact us for inquiries.

4. Cleanup

You can remove the FriendliDeployment using the kubectl CLI tool.


$ kubectl delete friendlideployment friendlideployment-sample
friendlideployment.friendli.ai "friendlideployment-sample" deleted

You may also want to scale down or delete your GPU node group to avoid being charged for unused GPU instances.

That’s it! You’ve now learned how to effectively utilize Friendli Container on AWS EKS to optimize your LLM inference workflows. If you’d like to explore more, feel free to refer to the detailed guide here. This will help you dive deeper into the deployment process and take full advantage of the benefits AWS EKS has to offer.

Conclusion

The Friendli Container AWS EKS Add-On delivers a high-performance, scalable, and cost-effective solution for deploying AI models in production environments. By leveraging AWS EKS and Friendli Container’s powerful optimizations, you can dramatically reduce inference costs and improve throughput for AI inference workloads.

If you're looking for a completely automated, further optimized solution that goes beyond Friendli Container Amazon EKS Add-on and handles absolutely everything for you, consider exploring Friendli Dedicated Endpoints.

If you have any questions or need support, don't hesitate to reach out to us or consult our documentation.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.