April 29, 2024
3 min read

Meta Llama 3 now available on Friendli

At FriendliAI, we’re on a mission to democratize access to cutting-edge generative AI models. That’s why we’re thrilled to announce the integration of Meta’s latest Llama 3 large language models (LLMs) into our platform.

Llama 3 represents a breakthrough in open-source LLM performance. This next-generation model family demonstrates state-of-the-art capabilities across a wide range of benchmarks, showcasing improved reasoning, multi-tasking, and few-shot learning abilities.

In line with our commitment to openness, we’ve made the 8 billion and 70 billion parameter versions of Llama 3 available through Friendli. These models unlock new frontiers in language understanding, generation, analysis, and more.

Some key advantages of Llama 3 include:

Cutting-edge performance rivaling models an order of magnitude larger
Stronger logical reasoning and multi-step problem-solving skills
Improved few-shot learning from limited examples
Robust handling of long contexts and document understanding

Whether you’re a researcher, developer, or working on innovative AI applications, Llama 3 offers robust new foundations to build. On Friendli, you can quickly fine-tune the models, or leverage them for inference at scale. FriendliAI is trusted by major players in LLMs like Upstage, ScatterLab, TUNiB, and many more. If you want to be a part of it, sign up for our service → Friendli Dedicated Endpoints | Friendli Container.

To use Llama 3 instantly signup to access Friendli Serverless Endpoints: Sign up
Go to User Settings > Tokens and create a personal access token by clicking ‘Create new token’.
Save your created token value.
Install friendli-client python package to use Python SDK to interact with the Serverless Endpoint for Llama. Run pip install friendli-client
Now initialize the Python client instance as follows:

python
from friendli import Friendli

client = Friendli(token=“YOUR PERSONAL ACCESS TOKEN”)

You can create a response from Llama 3 as follows:

python
chat_completion = client.chat.completions.create(
    model=“meta-llama-3-70b-instruct”,
    messages=[
        {
            “role”: “user”,
            “content”: “Tell me how to make a delicious pancake”
        }
    ],
    stream=False,
)
print(chat_completion.choices[0].message.content)

How Llama 3 performs on Friendli:

Here is a small snippet of Llama3 , quantized to FP8 by FriendliAI, which is now available to run on Friendli.

p90 latency graph of FP8 Llama3 70B (Friendli vs vLLM)

You can download FP8 Meta Llama 3 checkpoints at the Hugging Face Models hub also : https://huggingface.co/FriendliAI

Friendli Engine:

Our LLM inference serving engine is the fastest on the market. It is built for serving LLMs, more broadly generative AI models, with low latency and high throughput Friendli Engine provides high inference serving performance with its optimizations covering various use cases.

3 ways to use Llama 3 with Friendli Suite:

Friendli Suite offers three ways to leverage the power of Friendli Engine. Whether you want to run your LLMs on the cloud or on-premises, Friendli has got you covered.

Friendli Dedicated Endpoints: Run your generative AI models on dedicated GPUs, conveniently on autopilot.
Friendli Container: Deploy and serve your models in your GPU environment, whether in the cloud or on-premises, for complete control.
Friendli Serverless Endpoints: Start instantly with open-source models through our user-friendly API, which has the lowest costs in the market.

We’re excited to put this exceptional AI technology into the hands of our community and can’t wait to see what you create. The future of open and capable generative AI is here - Start building today on Friendli!

Check our Youtube channel to see more such model performance with FriendliAI!

Written by