- April 10, 2025
- 3 min read
Unleash Llama 4 on Friendli Dedicated Endpoints

We are excited to announce that Meta’s groundbreaking multimodal AI models—Llama 4 Scout and Llama 4 Maverick—are now integrated into FriendliAI’s Dedicated Endpoints. This powerful combination provides developers and businesses with access to state-of-the-art AI models, paired with the scalability and superior performance of FriendliAI's cutting-edge generative AI infrastructure.
Llama 4 Overview
The Llama 4 series introduces three advanced models: Scout, Maverick, and Behemoth. These auto-regressive language models leverage a Mixture of Experts (MoE) architecture, offering native multimodality through early fusion. With Llama 4, Meta has significantly pushed the boundaries of AI, creating models that seamlessly combine language processing, image understanding, and extended context lengths.
Model | Parameters | Context Length | Languages | Modality | Contestants |
---|---|---|---|---|---|
Llama 4 Scout | 17B Active, 16 Experts, 109B Total | 10M tokens | Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese | Image-Text-to-Text | Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1 |
Llama 4 Maverick | 17B Active, 128 Experts, 400B Total | 1M tokens | Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese | Image-Text-to-Text | GPT-4o, Gemini 2.0 Flash, DeepSeek v3 |
Llama 4 Behemoth | 288B Active, 16 Experts, 2T Total | N/A | N/A | N/A | GPT-4.5, Claude Sonnet 3.7, Gemini 2.0 Pro |
Llama 4 models incorporate lightweight Supervised Fine-Tuning (SFT), online Reinforcement Learning (RL), and lightweight Direct Preference Optimization (DPO), enhancing their capabilities across a wide variety of tasks. One of the most noteworthy advancements is the interleaved Rotary Position Embedding (iRoPE) architecture, which alternates between standard Rotational Positional Embedding (RoPE) and No Positional Encoding (NoPE), effectively reducing the quadratic complexity of the attention mechanism while also leveraging chunked attention processing. This approach could potentially enable infinite context lengths in the future, allowing Llama 4 to handle even more complex tasks with large datasets—achieving an unprecedented 10M tokens of context length with the Llama 4 Scout.
Meta has also made significant strides in reducing bias in Llama 4 models by refining training techniques that promote neutrality and fairness, providing more accurate and less harmful responses.
However, due to the immense parameter sizes of these models, deploying and scaling them can be challenging. This is where FriendliAI comes in.
Friendli Dedicated Endpoints
FriendliAI offers flexible, powerful environments for deploying and scaling AI models. Our Dedicated Endpoints provide efficient, cost-effective ways to leverage cutting-edge AI technologies, optimized for both performance and accuracy. Powered by Friendli Inference technology, these endpoints are designed to support demanding AI workloads while maintaining stability.
For businesses with specific performance requirements, Dedicated Endpoints offer exclusive access to GPU instances. This isolation ensures high-performance model inference, providing superior cost-efficiency even amid fluctuating demand.
Key Benefits of Using Llama 4 on FriendliAI:
-
Cost Efficiency Run Llama 4 Scout on just 2 H100 GPUs while maintaining its accuracy. Soon, you’ll be able to run Llama 4 Scout on a single H100 GPU and Llama 4 Maverick on only 4 H100 GPUs.
-
Superior Performance Experience low Time to First Token (TTFT), fast Time Per Output Token (TPOT), and high throughput, ensuring that your models perform at their best, even under heavy loads.
-
Stability FriendliAI provides a robust, consistent, and reliable environment, minimizing downtime and optimizing model performance even with heavy workloads.
-
One-Click Deployment from Hugging Face: Thanks to seamless integration with Hugging Face, you can deploy your custom models from Hugging Face directly to Friendli Dedicated Endpoints with just one click, streamlining the deployment process and reducing setup time significantly.
Getting Started
Deploying to Friendli Dedicated Endpoints from Hugging Face
- Select “Friendli Endpoints” from the “Deploy” tab on Hugging Face model page.
- Click “Deploy now” to deploy models to Friendli Dedicated Endpoints and start using Llama 4.
That’s it, you are now all set to begin using Llama 4.
With Llama 4 now available on FriendliAI’s Dedicated Endpoints, businesses and developers can tap into the full potential of these state-of-the-art AI models. Whether you're building the next breakthrough application or enhancing existing systems, Llama 4 provides the performance and flexibility you need.
Experience the future of AI with FriendliAI–unleash the true potential of Llama 4 today! Click here to accelerate your generative AI.
Written by
FriendliAI Tech & Research
Share