August 22, 2024
5 min read

Experience Meta Llama 3.1’s Outstanding Performance on Friendli

We are pleased to share that Meta’s Llama 3.1 large language models (LLMs) are available on the FriendliAI platform. Our platform streamlines access to these open-source models, enabling users to efficiently leverage advanced generative AI.

On Friendli Suite, users can now enjoy high inference performance with all Llama 3.1 models: 8B, 70B, and 405B. In fact, you can generate over 100 tokens per second for Llama 3.1 70B on Friendli Serverless Endpoints!

Llama 3.1 represents a significant leap in open-source LLM performance, rivaling state-of-the-art closed-source models such as OpenAI’s GPT-4, GPT-4o, and Anthropic’s Claude 3.5 sonnet. This next-generation model family showcases improved tool use, complex reasoning, multilingual capabilities, and increased context lengths.

The 8 billion and 70 billion parameter versions of Llama 3.1 are available through Friendli Serverless Endpoints. They can also be inferenced at scale or fine-tuned via Friendli Dedicated Endpoints. These models open new frontiers in agentic systems, distillation, synthetic data generation, and beyond!

Key advantages of Llama 3.1 include:

Competing with the Best Closed-Source Models

Llama 3.1 models excel in performance benchmarks, matching or surpassing the leading closed-source models. Below is the list of popular benchmarks for measuring their performance, along with a short description on the performance of each of the Llama 3.1 8B, 70B, 405B models compared to other competing models.

MMLU Benchmark (0-shot, CoT): Measures general understanding and multitask capabilities.
- Llama 3.1 8B: 73.0, rivaling Gemma 2 9B IT 5-shot, non-CoT (72.3)
- Llama 3.1 70B: 86.0, surpassing Mixtral 8x22B Instruct (79.9) and GPT 3.5 Turbo (69.8)
- Llama 3.1 405B: 88.6, on par with GPT-4 (85.4), Claude 3.5 Sonnet (88.3), and GPT-4o (88.7)
Math GSM8K Benchmark (8-shot, CoT): Evaluates math problem-solving skills.
- Llama 3.1 8B: 84.5, surpassing Gemma 2 9B IT (76.7)
- Llama 3.1 70B: 95.1, outperforming Mixtral 8x22B Instruct (88.2) and GPT 3.5 Turbo (81.6)
- Llama 3.1 405B: 96.8, on par with GPT-4o (96.1) and Claude 3.5 Sonnet 0-shot (96.4)

State-of-the-Art Tool Use Including Multi-Step Reasoning

Llama 3.1 showcases accurate tool use with multi-step reasoning, outperforming many competitors. Below is the list of popular benchmarks, following the same format as above.

BFCL Benchmark: Assesses parallel multiple tool calling.
- Llama 3.1 8B: 76.1, ahead of Mistral 7B Instruct (60.4)
- Llama 3.1 70B: 84.8, comparable to GPT 3.5 Turbo (85.9)
- Llama 3.1 405B: 88.5, on par with GPT-4 (88.3) and Claude 3.5 Sonnet (90.2)
Nexus Benchmark: Evaluates nested tool calling.
- Llama 3.1 8B: 38.5, outperforming Gemma 2 9B IT (30.0) and Mistral 7B Instruct (24.7)
- Llama 3.1 70B: 56.7, exceeding Mixtral 8x22B Instruct (48.5) and GPT 3.5 Turbo (37.2)
- Llama 3.1 405B: 58.7, beating GPT-4o (56.1)

Expanded Context Length to 128K and Support Across Eight Languages

Llama 3.1 extends its context length to 128K, facilitating extensive document processing, and also supports eight different languages.

Multilingual MGSM Benchmark: Measures multilingual capabilities.
- Llama 3.1 8B: 68.9, surpassing Gemma 2 9B IT (53.2) and Mistral 7B Instruct (29.9)
- Llama 3.1 70B: 86.9, outperforming Mixtral 8x22B Instruct (71.1) and GPT 3.5 Turbo (51.4)
- Llama 3.1 405B: 91.6, ahead of GPT-4o (90.5) and on par with Claude 3.5 Sonnet (91.6)

Overall Stronger Reasoning Capabilities

Llama 3.1 demonstrates superior reasoning abilities, excelling in various reasoning benchmarks.

ARC Challenge Benchmark (0-shot): Assesses advanced reasoning capabilities.
- Llama 3.1 8B: 83.4, comparable to Gemma 2 9B IT (87.6)
- Llama 3.1 70B: 94.8, surpassing Mixtral 8x22B Instruct (88.7) and GPT 3.5 Turbo (83.7)
- Llama 3.1 405B: 96.9, on par with GPT-4o (96.7) and Claude 3.5 Sonnet (96.7)

Supporting Advanced Use Cases

Llama 3.1 supports a wide array of advanced applications including long-form text summarization, multilingual conversational agents, and coding assistants.

Code HumanEval Benchmark (0-shot): Evaluates code generation and understanding.
- Llama 3.1 8B: 72.6, outperforming Gemma 2 9B IT (54.3)
- Llama 3.1 70B: 80.5, surpassing Mixtral 8x22B Instruct (75.6) and GPT 3.5 Turbo (68.0)
- Llama 3.1 405B: 89.0, comparable to GPT-4o (90.2) and Claude 3.5 Sonnet (92.0) and exceeding GPT-4 (86.6)

All of the evaluation numbers are referenced from Meta’s blog featuring the below tables.

Source: Model evaluations from “Introducing Llama 3.1: Our most capable models to date”

Source: Model evaluations from “Introducing Llama 3.1: Our most capable models to date”

How to Get Started with Friendli Endpoints

Whether you’re a researcher, developer, or working on innovative AI agent projects, Llama 3.1 offers new foundations to build on. In order to materialize its potentials, we provide fine-tuning capabilities for such open-source models, in addition to the deployment of the model, through Friendli Dedicated Endpoints. With Friendli Dedicated Endpoints, you can swiftly fine-tune models and instantly leverage them for efficient inference serving at scale. In this blog post, we will introduce a way to try out the official Llama 3.1 model right away, by using an environment that is already set up for an immediate execution, through Friendli Serverless Endpoints on the Friendli Suite. To learn more about deploying models on Friendli Dedicated Endpoints, refer to our documentation.

Sign up to access Friendli Serverless Endpoints on Friendli Suite: Sign up
Go to Personal Settings > Tokens and create a personal access token by clicking ‘Create new token’.
Save your created token value.
Install friendli-client python package to use Python SDK to interact with the Serverless Endpoint for Llama, by running pip install friendli-client
Now initialize the Python client instance as follows:

python

from friendli import Friendli

client = Friendli(token="YOUR PERSONAL ACCESS TOKEN")

You can create a response from Llama 3.1 as follows:

python

chat_completion = client.chat.completions.create(
    model="meta-llama-3.1-70b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Tell me how to make a delicious pancake"
        }
    ],
    stream=False,
)
print(chat_completion.choices[0].message.content)

Example Result:

unset

Making delicious pancakes is a straightforward process that requires just a few ingredients and some basic cooking skills. Here's a simple recipe to get you started:

Ingredients:

*   1 cup all-purpose flour
*   2 tablespoons sugar
*   2 teaspoons baking powder
*   1/4 teaspoon salt
*   1 cup milk
*   1 large egg
*   2 tablespoons unsalted butter, melted
*   Butter or oil for greasing the pan

Instructions:

1.  In a large bowl, whisk together the flour, sugar, baking powder, and salt.
2.  In a separate bowl, whisk together the milk, egg, and melted butter.
3.  Add the wet ingredients to the dry ingredients and stir until just combined. The batter should still be slightly lumpy.
4.  Heat a non-stick skillet or griddle over medium heat. Grease the pan with butter or oil.
5.  Using a 1/4 cup measuring cup, scoop the batter onto the pan.
6.  Cook the pancakes for 2-3 minutes, until bubbles appear on the surface and the edges start to dry.
7.  Flip the pancakes and cook for another 1-2 minutes, until golden brown.
8.  Serve the pancakes hot with your favorite toppings, such as maple syrup, fresh fruit, whipped cream, or powdered sugar.

Tips:

*   Use fresh ingredients: Fresh milk, eggs, and butter will give your pancakes a richer flavor and a fluffier texture.
*   Don't overmix: Overmixing the batter can result in tough, dense pancakes. Stop stirring as soon as the ingredients are just combined.
*   Use the right pan: A non-stick skillet or griddle is ideal for making pancakes. If you don't have one, you can use a cast-iron skillet or a stainless steel pan with a small amount of oil.
*   Don't overcrowd: Cook the pancakes in batches if necessary, to ensure they have enough room to cook evenly.

Variations:

*   Blueberry pancakes: Add 1/2 cup fresh or frozen blueberries to the batter for a fruity twist.
*   Chocolate chip pancakes: Add 1/2 cup semisweet chocolate chips to the batter for a sweet treat.
*   Banana pancakes: Mash 1 ripe banana and add it to the batter for a moist and flavorful pancake.

Three ways to use Llama 3.1 with Friendli Suite:

Friendli Suite offers three ways to leverage the power of the Friendli Inference. Whether you want to run your LLMs on the cloud or on-premises, Friendli’s got you covered.

Friendli Dedicated Endpoints: Fine-tune and run your generative AI models on dedicated GPUs, conveniently on autopilot.
Friendli Container: Deploy and serve your models in your GPU environment, whether in the cloud or on-premises, for complete control.
Friendli Serverless Endpoints: Start instantly with open-source models through our user-friendly API, which has the lowest costs in the market.

We’re excited to put this exceptional AI technology into the hands of our community and can’t wait to see what you create. The future of using generative AI for agentic applications is here - Start building today on Friendli!

Check out our Youtube channel to see more such model performances with FriendliAI!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.