- August 22, 2024
- 7 min read
Experience Meta Llama 3.1’s Outstanding Performance on Friendli
We are pleased to share that Meta’s Llama 3.1 large language models (LLMs) are available on the FriendliAI platform. Our platform streamlines access to these open-source models, enabling users to efficiently leverage advanced generative AI.
On Friendli Suite, users can now enjoy high inference performance with all Llama 3.1 models: 8B, 70B, and 405B. In fact, you can generate over 100 tokens per second for Llama 3.1 70B on Friendli Serverless Endpoints!
Llama 3.1 represents a significant leap in open-source LLM performance, rivaling state-of-the-art closed-source models such as OpenAI’s GPT-4, GPT-4o, and Anthropic’s Claude 3.5 sonnet. This next-generation model family showcases improved tool use, complex reasoning, multilingual capabilities, and increased context lengths.
The 8 billion and 70 billion parameter versions of Llama 3.1 are available through Friendli Serverless Endpoints. They can also be inferenced at scale or fine-tuned via Friendli Dedicated Endpoints. These models open new frontiers in agentic systems, distillation, synthetic data generation, and beyond!
Key advantages of Llama 3.1 include:
Competing with the Best Closed-Source Models
Llama 3.1 models excel in performance benchmarks, matching or surpassing the leading closed-source models. Below is the list of popular benchmarks for measuring their performance, along with a short description on the performance of each of the Llama 3.1 8B, 70B, 405B models compared to other competing models.
-
MMLU Benchmark (0-shot, CoT): Measures general understanding and multitask capabilities.
- Llama 3.1 8B: 73.0, rivaling Gemma 2 9B IT 5-shot, non-CoT (72.3)
- Llama 3.1 70B: 86.0, surpassing Mixtral 8x22B Instruct (79.9) and GPT 3.5 Turbo (69.8)
- Llama 3.1 405B: 88.6, on par with GPT-4 (85.4), Claude 3.5 Sonnet (88.3), and GPT-4o (88.7)
-
Math GSM8K Benchmark (8-shot, CoT): Evaluates math problem-solving skills.
- Llama 3.1 8B: 84.5, surpassing Gemma 2 9B IT (76.7)
- Llama 3.1 70B: 95.1, outperforming Mixtral 8x22B Instruct (88.2) and GPT 3.5 Turbo (81.6)
- Llama 3.1 405B: 96.8, on par with GPT-4o (96.1) and Claude 3.5 Sonnet 0-shot (96.4)
State-of-the-Art Tool Use Including Multi-Step Reasoning
Llama 3.1 showcases accurate tool use with multi-step reasoning, outperforming many competitors. Below is the list of popular benchmarks, following the same format as above.
-
BFCL Benchmark: Assesses parallel multiple tool calling.
- Llama 3.1 8B: 76.1, ahead of Mistral 7B Instruct (60.4)
- Llama 3.1 70B: 84.8, comparable to GPT 3.5 Turbo (85.9)
- Llama 3.1 405B: 88.5, on par with GPT-4 (88.3) and Claude 3.5 Sonnet (90.2)
-
Nexus Benchmark: Evaluates nested tool calling.
- Llama 3.1 8B: 38.5, outperforming Gemma 2 9B IT (30.0) and Mistral 7B Instruct (24.7)
- Llama 3.1 70B: 56.7, exceeding Mixtral 8x22B Instruct (48.5) and GPT 3.5 Turbo (37.2)
- Llama 3.1 405B: 58.7, beating GPT-4o (56.1)
Expanded Context Length to 128K and Support Across Eight Languages
Llama 3.1 extends its context length to 128K, facilitating extensive document processing, and also supports eight different languages.
- Multilingual MGSM Benchmark: Measures multilingual capabilities.
- Llama 3.1 8B: 68.9, surpassing Gemma 2 9B IT (53.2) and Mistral 7B Instruct (29.9)
- Llama 3.1 70B: 86.9, outperforming Mixtral 8x22B Instruct (71.1) and GPT 3.5 Turbo (51.4)
- Llama 3.1 405B: 91.6, ahead of GPT-4o (90.5) and on par with Claude 3.5 Sonnet (91.6)
Overall Stronger Reasoning Capabilities
Llama 3.1 demonstrates superior reasoning abilities, excelling in various reasoning benchmarks.
- ARC Challenge Benchmark (0-shot): Assesses advanced reasoning capabilities.
- Llama 3.1 8B: 83.4, comparable to Gemma 2 9B IT (87.6)
- Llama 3.1 70B: 94.8, surpassing Mixtral 8x22B Instruct (88.7) and GPT 3.5 Turbo (83.7)
- Llama 3.1 405B: 96.9, on par with GPT-4o (96.7) and Claude 3.5 Sonnet (96.7)
Supporting Advanced Use Cases
Llama 3.1 supports a wide array of advanced applications including long-form text summarization, multilingual conversational agents, and coding assistants.
- Code HumanEval Benchmark (0-shot): Evaluates code generation and understanding.
- Llama 3.1 8B: 72.6, outperforming Gemma 2 9B IT (54.3)
- Llama 3.1 70B: 80.5, surpassing Mixtral 8x22B Instruct (75.6) and GPT 3.5 Turbo (68.0)
- Llama 3.1 405B: 89.0, comparable to GPT-4o (90.2) and Claude 3.5 Sonnet (92.0) and exceeding GPT-4 (86.6)
All of the evaluation numbers are referenced from Meta’s blog featuring the below tables.
Source: Model evaluations from “Introducing Llama 3.1: Our most capable models to date”