- August 22, 2024
- 5 min read
Experience Meta Llama 3.1’s Outstanding Performance on Friendli

We are pleased to share that Meta’s Llama 3.1 large language models (LLMs) are available on the FriendliAI platform. Our platform streamlines access to these open-source models, enabling users to efficiently leverage advanced generative AI.
On Friendli Suite, users can now enjoy high inference performance with all Llama 3.1 models: 8B, 70B, and 405B. In fact, you can generate over 100 tokens per second for Llama 3.1 70B on Friendli Serverless Endpoints!
Llama 3.1 represents a significant leap in open-source LLM performance, rivaling state-of-the-art closed-source models such as OpenAI’s GPT-4, GPT-4o, and Anthropic’s Claude 3.5 sonnet. This next-generation model family showcases improved tool use, complex reasoning, multilingual capabilities, and increased context lengths.
The 8 billion and 70 billion parameter versions of Llama 3.1 are available through Friendli Serverless Endpoints. They can also be inferenced at scale or fine-tuned via Friendli Dedicated Endpoints. These models open new frontiers in agentic systems, distillation, synthetic data generation, and beyond!
Key advantages of Llama 3.1 include:
Competing with the Best Closed-Source Models
Llama 3.1 models excel in performance benchmarks, matching or surpassing the leading closed-source models. Below is the list of popular benchmarks for measuring their performance, along with a short description on the performance of each of the Llama 3.1 8B, 70B, 405B models compared to other competing models.
-
MMLU Benchmark (0-shot, CoT): Measures general understanding and multitask capabilities.
- Llama 3.1 8B: 73.0, rivaling Gemma 2 9B IT 5-shot, non-CoT (72.3)
- Llama 3.1 70B: 86.0, surpassing Mixtral 8x22B Instruct (79.9) and GPT 3.5 Turbo (69.8)
- Llama 3.1 405B: 88.6, on par with GPT-4 (85.4), Claude 3.5 Sonnet (88.3), and GPT-4o (88.7)
-
Math GSM8K Benchmark (8-shot, CoT): Evaluates math problem-solving skills.
- Llama 3.1 8B: 84.5, surpassing Gemma 2 9B IT (76.7)
- Llama 3.1 70B: 95.1, outperforming Mixtral 8x22B Instruct (88.2) and GPT 3.5 Turbo (81.6)
- Llama 3.1 405B: 96.8, on par with GPT-4o (96.1) and Claude 3.5 Sonnet 0-shot (96.4)
State-of-the-Art Tool Use Including Multi-Step Reasoning
Llama 3.1 showcases accurate tool use with multi-step reasoning, outperforming many competitors. Below is the list of popular benchmarks, following the same format as above.
-
BFCL Benchmark: Assesses parallel multiple tool calling.
- Llama 3.1 8B: 76.1, ahead of Mistral 7B Instruct (60.4)
- Llama 3.1 70B: 84.8, comparable to GPT 3.5 Turbo (85.9)
- Llama 3.1 405B: 88.5, on par with GPT-4 (88.3) and Claude 3.5 Sonnet (90.2)
-
Nexus Benchmark: Evaluates nested tool calling.
- Llama 3.1 8B: 38.5, outperforming Gemma 2 9B IT (30.0) and Mistral 7B Instruct (24.7)
- Llama 3.1 70B: 56.7, exceeding Mixtral 8x22B Instruct (48.5) and GPT 3.5 Turbo (37.2)
- Llama 3.1 405B: 58.7, beating GPT-4o (56.1)
Expanded Context Length to 128K and Support Across Eight Languages
Llama 3.1 extends its context length to 128K, facilitating extensive document processing, and also supports eight different languages.
- Multilingual MGSM Benchmark: Measures multilingual capabilities.
- Llama 3.1 8B: 68.9, surpassing Gemma 2 9B IT (53.2) and Mistral 7B Instruct (29.9)
- Llama 3.1 70B: 86.9, outperforming Mixtral 8x22B Instruct (71.1) and GPT 3.5 Turbo (51.4)
- Llama 3.1 405B: 91.6, ahead of GPT-4o (90.5) and on par with Claude 3.5 Sonnet (91.6)
Overall Stronger Reasoning Capabilities
Llama 3.1 demonstrates superior reasoning abilities, excelling in various reasoning benchmarks.
- ARC Challenge Benchmark (0-shot): Assesses advanced reasoning capabilities.
- Llama 3.1 8B: 83.4, comparable to Gemma 2 9B IT (87.6)
- Llama 3.1 70B: 94.8, surpassing Mixtral 8x22B Instruct (88.7) and GPT 3.5 Turbo (83.7)
- Llama 3.1 405B: 96.9, on par with GPT-4o (96.7) and Claude 3.5 Sonnet (96.7)
Supporting Advanced Use Cases
Llama 3.1 supports a wide array of advanced applications including long-form text summarization, multilingual conversational agents, and coding assistants.
- Code HumanEval Benchmark (0-shot): Evaluates code generation and understanding.
- Llama 3.1 8B: 72.6, outperforming Gemma 2 9B IT (54.3)
- Llama 3.1 70B: 80.5, surpassing Mixtral 8x22B Instruct (75.6) and GPT 3.5 Turbo (68.0)
- Llama 3.1 405B: 89.0, comparable to GPT-4o (90.2) and Claude 3.5 Sonnet (92.0) and exceeding GPT-4 (86.6)
All of the evaluation numbers are referenced from Meta’s blog featuring the below tables.
Source: Model evaluations from “Introducing Llama 3.1: Our most capable models to date”
Source: Model evaluations from “Introducing Llama 3.1: Our most capable models to date”
How to Get Started with Friendli Endpoints
Whether you’re a researcher, developer, or working on innovative AI agent projects, Llama 3.1 offers new foundations to build on. In order to materialize its potentials, we provide fine-tuning capabilities for such open-source models, in addition to the deployment of the model, through Friendli Dedicated Endpoints. With Friendli Dedicated Endpoints, you can swiftly fine-tune models and instantly leverage them for efficient inference serving at scale. In this blog post, we will introduce a way to try out the official Llama 3.1 model right away, by using an environment that is already set up for an immediate execution, through Friendli Serverless Endpoints on the Friendli Suite. To learn more about deploying models on Friendli Dedicated Endpoints, refer to our documentation.
- Sign up to access Friendli Serverless Endpoints on Friendli Suite: Sign up
- Go to Personal Settings > Tokens and create a personal access token by clicking ‘Create new token’.
- Save your created token value.
- Install
friendli-client
python package to use Python SDK to interact with the Serverless Endpoint for Llama, by runningpip install friendli-client
- Now initialize the Python client instance as follows:
python
- You can create a response from Llama 3.1 as follows:
python
Example Result:
unset
Three ways to use Llama 3.1 with Friendli Suite:
Friendli Suite offers three ways to leverage the power of the Friendli Inference. Whether you want to run your LLMs on the cloud or on-premises, Friendli’s got you covered.
- Friendli Dedicated Endpoints: Fine-tune and run your generative AI models on dedicated GPUs, conveniently on autopilot.
- Friendli Container: Deploy and serve your models in your GPU environment, whether in the cloud or on-premises, for complete control.
- Friendli Serverless Endpoints: Start instantly with open-source models through our user-friendly API, which has the lowest costs in the market.
We’re excited to put this exceptional AI technology into the hands of our community and can’t wait to see what you create. The future of using generative AI for agentic applications is here - Start building today on Friendli!
Check out our Youtube channel to see more such model performances with FriendliAI!
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.