April 10, 2025
3 min read

Unleash Llama 4 on Friendli Dedicated Endpoints

We are excited to announce that Meta’s groundbreaking multimodal AI models—Llama 4 Scout and Llama 4 Maverick—are now integrated into FriendliAI’s Dedicated Endpoints. This powerful combination provides developers and businesses with access to state-of-the-art AI models, paired with the scalability and superior performance of FriendliAI's cutting-edge generative AI infrastructure.

Llama 4 Overview

The Llama 4 series introduces three advanced models: Scout, Maverick, and Behemoth. These auto-regressive language models leverage a Mixture of Experts (MoE) architecture, offering native multimodality through early fusion. With Llama 4, Meta has significantly pushed the boundaries of AI, creating models that seamlessly combine language processing, image understanding, and extended context lengths.

Model	Parameters	Context Length	Languages	Modality	Contestants
Llama 4 Scout	17B Active, 16 Experts, 109B Total	10M tokens	Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese	Image-Text-to-Text	Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1
Llama 4 Maverick	17B Active, 128 Experts, 400B Total	1M tokens	Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese	Image-Text-to-Text	GPT-4o, Gemini 2.0 Flash, DeepSeek v3
Llama 4 Behemoth	288B Active, 16 Experts, 2T Total	N/A	N/A	N/A	GPT-4.5, Claude Sonnet 3.7, Gemini 2.0 Pro

Figure 1: Llama 4 Overview.

Llama 4 models incorporate lightweight Supervised Fine-Tuning (SFT), online Reinforcement Learning (RL), and lightweight Direct Preference Optimization (DPO), enhancing their capabilities across a wide variety of tasks. One of the most noteworthy advancements is the interleaved Rotary Position Embedding (iRoPE) architecture, which alternates between standard Rotational Positional Embedding (RoPE) and No Positional Encoding (NoPE), effectively reducing the quadratic complexity of the attention mechanism while also leveraging chunked attention processing. This approach could potentially enable infinite context lengths in the future, allowing Llama 4 to handle even more complex tasks with large datasets—achieving an unprecedented 10M tokens of context length with the Llama 4 Scout.

Meta has also made significant strides in reducing bias in Llama 4 models by refining training techniques that promote neutrality and fairness, providing more accurate and less harmful responses.

However, due to the immense parameter sizes of these models, deploying and scaling them can be challenging. This is where FriendliAI comes in.

Friendli Dedicated Endpoints

FriendliAI offers flexible, powerful environments for deploying and scaling AI models. Our Dedicated Endpoints provide efficient, cost-effective ways to leverage cutting-edge AI technologies, optimized for both performance and accuracy. Powered by Friendli Inference technology, these endpoints are designed to support demanding AI workloads while maintaining stability.

For businesses with specific performance requirements, Dedicated Endpoints offer exclusive access to GPU instances. This isolation ensures high-performance model inference, providing superior cost-efficiency even amid fluctuating demand.

Key Benefits of Using Llama 4 on FriendliAI

Cost Efficiency Run Llama 4 Scout on just 2 H100 GPUs while maintaining its accuracy. Soon, you’ll be able to run Llama 4 Scout on a single H100 GPU and Llama 4 Maverick on only 4 H100 GPUs.
Superior Performance Experience low Time to First Token (TTFT), fast Time Per Output Token (TPOT), and high throughput, ensuring that your models perform at their best, even under heavy loads.
Stability FriendliAI provides a robust, consistent, and reliable environment, minimizing downtime and optimizing model performance even with heavy workloads.
One-Click Deployment from Hugging Face: Thanks to seamless integration with Hugging Face, you can deploy your custom models from Hugging Face directly to Friendli Dedicated Endpoints with just one click, streamlining the deployment process and reducing setup time significantly.

Getting Started

Deploying to Friendli Dedicated Endpoints from Hugging Face

1. Select “Friendli Endpoints” from the “Deploy” tab on Hugging Face model page

Figure 2: Deploying from Hugging Face.

2. Click “Deploy now” to deploy models to Friendli Dedicated Endpoints and start using Llama 4

Figure 3: Deploying on FriendliAI.

That’s it, you are now all set to begin using Llama 4.

With Llama 4 now available on FriendliAI’s Dedicated Endpoints, businesses and developers can tap into the full potential of these state-of-the-art AI models. Whether you're building the next breakthrough application or enhancing existing systems, Llama 4 provides the performance and flexibility you need.

Experience the future of AI with FriendliAI–unleash the true potential of Llama 4 today! Click here to accelerate your generative AI.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.