(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • August 22, 2024
  • 5 min read

Experience Meta Llama 3.1’s Outstanding Performance on Friendli

Experience Meta Llama 3.1’s Outstanding Performance on Friendli thumbnail

We are pleased to share that Meta’s Llama 3.1 large language models (LLMs) are available on the FriendliAI platform. Our platform streamlines access to these open-source models, enabling users to efficiently leverage advanced generative AI.

On Friendli Suite, users can now enjoy high inference performance with all Llama 3.1 models: 8B, 70B, and 405B. In fact, you can generate over 100 tokens per second for Llama 3.1 70B on Friendli Serverless Endpoints!

Llama 3.1 represents a significant leap in open-source LLM performance, rivaling state-of-the-art closed-source models such as OpenAI’s GPT-4, GPT-4o, and Anthropic’s Claude 3.5 sonnet. This next-generation model family showcases improved tool use, complex reasoning, multilingual capabilities, and increased context lengths.

The 8 billion and 70 billion parameter versions of Llama 3.1 are available through Friendli Serverless Endpoints. They can also be inferenced at scale or fine-tuned via Friendli Dedicated Endpoints. These models open new frontiers in agentic systems, distillation, synthetic data generation, and beyond!

Key advantages of Llama 3.1 include:

Competing with the Best Closed-Source Models

Llama 3.1 models excel in performance benchmarks, matching or surpassing the leading closed-source models. Below is the list of popular benchmarks for measuring their performance, along with a short description on the performance of each of the Llama 3.1 8B, 70B, 405B models compared to other competing models.

  • MMLU Benchmark (0-shot, CoT): Measures general understanding and multitask capabilities.

    • Llama 3.1 8B: 73.0, rivaling Gemma 2 9B IT 5-shot, non-CoT (72.3)
    • Llama 3.1 70B: 86.0, surpassing Mixtral 8x22B Instruct (79.9) and GPT 3.5 Turbo (69.8)
    • Llama 3.1 405B: 88.6, on par with GPT-4 (85.4), Claude 3.5 Sonnet (88.3), and GPT-4o (88.7)
  • Math GSM8K Benchmark (8-shot, CoT): Evaluates math problem-solving skills.

    • Llama 3.1 8B: 84.5, surpassing Gemma 2 9B IT (76.7)
    • Llama 3.1 70B: 95.1, outperforming Mixtral 8x22B Instruct (88.2) and GPT 3.5 Turbo (81.6)
    • Llama 3.1 405B: 96.8, on par with GPT-4o (96.1) and Claude 3.5 Sonnet 0-shot (96.4)

State-of-the-Art Tool Use Including Multi-Step Reasoning

Llama 3.1 showcases accurate tool use with multi-step reasoning, outperforming many competitors. Below is the list of popular benchmarks, following the same format as above.

  • BFCL Benchmark: Assesses parallel multiple tool calling.

    • Llama 3.1 8B: 76.1, ahead of Mistral 7B Instruct (60.4)
    • Llama 3.1 70B: 84.8, comparable to GPT 3.5 Turbo (85.9)
    • Llama 3.1 405B: 88.5, on par with GPT-4 (88.3) and Claude 3.5 Sonnet (90.2)
  • Nexus Benchmark: Evaluates nested tool calling.

    • Llama 3.1 8B: 38.5, outperforming Gemma 2 9B IT (30.0) and Mistral 7B Instruct (24.7)
    • Llama 3.1 70B: 56.7, exceeding Mixtral 8x22B Instruct (48.5) and GPT 3.5 Turbo (37.2)
    • Llama 3.1 405B: 58.7, beating GPT-4o (56.1)

Expanded Context Length to 128K and Support Across Eight Languages

Llama 3.1 extends its context length to 128K, facilitating extensive document processing, and also supports eight different languages.

  • Multilingual MGSM Benchmark: Measures multilingual capabilities.
    • Llama 3.1 8B: 68.9, surpassing Gemma 2 9B IT (53.2) and Mistral 7B Instruct (29.9)
    • Llama 3.1 70B: 86.9, outperforming Mixtral 8x22B Instruct (71.1) and GPT 3.5 Turbo (51.4)
    • Llama 3.1 405B: 91.6, ahead of GPT-4o (90.5) and on par with Claude 3.5 Sonnet (91.6)

Overall Stronger Reasoning Capabilities

Llama 3.1 demonstrates superior reasoning abilities, excelling in various reasoning benchmarks.

  • ARC Challenge Benchmark (0-shot): Assesses advanced reasoning capabilities.
    • Llama 3.1 8B: 83.4, comparable to Gemma 2 9B IT (87.6)
    • Llama 3.1 70B: 94.8, surpassing Mixtral 8x22B Instruct (88.7) and GPT 3.5 Turbo (83.7)
    • Llama 3.1 405B: 96.9, on par with GPT-4o (96.7) and Claude 3.5 Sonnet (96.7)

Supporting Advanced Use Cases

Llama 3.1 supports a wide array of advanced applications including long-form text summarization, multilingual conversational agents, and coding assistants.

  • Code HumanEval Benchmark (0-shot): Evaluates code generation and understanding.
    • Llama 3.1 8B: 72.6, outperforming Gemma 2 9B IT (54.3)
    • Llama 3.1 70B: 80.5, surpassing Mixtral 8x22B Instruct (75.6) and GPT 3.5 Turbo (68.0)
    • Llama 3.1 405B: 89.0, comparable to GPT-4o (90.2) and Claude 3.5 Sonnet (92.0) and exceeding GPT-4 (86.6)

All of the evaluation numbers are referenced from Meta’s blog featuring the below tables.

How to Get Started with Friendli Endpoints

Whether you’re a researcher, developer, or working on innovative AI agent projects, Llama 3.1 offers new foundations to build on. In order to materialize its potentials, we provide fine-tuning capabilities for such open-source models, in addition to the deployment of the model, through Friendli Dedicated Endpoints. With Friendli Dedicated Endpoints, you can swiftly fine-tune models and instantly leverage them for efficient inference serving at scale. In this blog post, we will introduce a way to try out the official Llama 3.1 model right away, by using an environment that is already set up for an immediate execution, through Friendli Serverless Endpoints on the Friendli Suite. To learn more about deploying models on Friendli Dedicated Endpoints, refer to our documentation.

  1. Sign up to access Friendli Serverless Endpoints on Friendli Suite: Sign up
  2. Go to User Settings > Tokens and create a personal access token by clicking ‘Create new token’.
  3. Save your created token value.
  4. Install friendli-client python package to use Python SDK to interact with the Serverless Endpoint for Llama, by running pip install friendli-client
  5. Now initialize the Python client instance as follows:
python
from friendli import Friendli

client = Friendli(token="YOUR PERSONAL ACCESS TOKEN")
  1. You can create a response from Llama 3.1 as follows:
python
chat_completion = client.chat.completions.create(
    model="meta-llama-3.1-70b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Tell me how to make a delicious pancake"
        }
    ],
    stream=False,
)
print(chat_completion.choices[0].message.content)

Example Result:

unset
Making delicious pancakes is a straightforward process that requires just a few ingredients and some basic cooking skills. Here's a simple recipe to get you started:

Ingredients:

*   1 cup all-purpose flour
*   2 tablespoons sugar
*   2 teaspoons baking powder
*   1/4 teaspoon salt
*   1 cup milk
*   1 large egg
*   2 tablespoons unsalted butter, melted
*   Butter or oil for greasing the pan

Instructions:

1.  In a large bowl, whisk together the flour, sugar, baking powder, and salt.
2.  In a separate bowl, whisk together the milk, egg, and melted butter.
3.  Add the wet ingredients to the dry ingredients and stir until just combined. The batter should still be slightly lumpy.
4.  Heat a non-stick skillet or griddle over medium heat. Grease the pan with butter or oil.
5.  Using a 1/4 cup measuring cup, scoop the batter onto the pan.

Three ways to use Llama 3.1 with Friendli Suite:

Friendli Suite offers three ways to leverage the power of the Friendli Engine. Whether you want to run your LLMs on the cloud or on-premises, Friendli’s got you covered.

  • Friendli Dedicated Endpoints: Fine-tune and run your generative AI models on dedicated GPUs, conveniently on autopilot.
  • Friendli Container: Deploy and serve your models in your GPU environment, whether in the cloud or on-premises, for complete control.
  • Friendli Serverless Endpoints: Start instantly with open-source models through our user-friendly API, which has the lowest costs in the market.

We’re excited to put this exceptional AI technology into the hands of our community and can’t wait to see what you create. The future of using generative AI for agentic applications is here - Start building today on Friendli!

Check out our Youtube channel to see more such model performances with FriendliAI!


Written by

FriendliAI logo

FriendliAI Tech & Research


Share