January 24, 2024
3 min read

Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints

The world of large language models (LLMs) is teeming with innovation, and the latest offering from Mistral AI, the Mixtral model, is gaining focus due to its groundbreaking approach. But what exactly makes Mixtral stand out in the ever-growing crowd of AI models? Let's delve into its core principles and impressive performance metrics to understand why it deserves your attention, and on how to actually try out the model for yourself.

MoE: The Secret Sauce Behind Mixtral

At its heart, Mixtral leverages a powerful technique called Mixture of Experts (MoE). Unlike traditional monolithic models, MoE divides its processing power into 8 distinct groups of parameters, called "experts". At each layer, for each token, Mixtral employs a router network that intelligently selects two of these experts to handle the token's information. Their outputs are then combined additively, resulting in a richer and more nuanced understanding of the input.

This clever approach delivers two significant advantages:

Boosted Parameter Power: The total parameter count of Mixtral reaches 46.7 billion. However, the MoE architecture ensures that only a fraction of these parameters are actually used per token, making it computationally efficient. Concretely, Mixtral utilizes just 12.9 billion parameters per token, resulting in processing speeds and costs equivalent to a 12.9 billion model. In essence, you get the benefits of a massive model without the associated hefty hardware demands.
Cost-Effectiveness and Latency Reduction: By employing only a fraction of its parameters per token, Mixtral delivers astoundingly fast inference speeds. Compared to the widely-used Llama 2 70B model, Mixtral boasts 6x faster inference while consistently matching or surpassing its performance on various benchmarks.

Mixtral's Performance

The success of any model hinges on its actual capabilities, and Mixtral shows good performances. Here’s a glimpse of its impressive achievements, taken from its official website.

Comparison of Mixtral to the Llama 2 family and the GPT-3.5 base model-FriendliAI

Fig. 1: Comparison of Mixtral to the Llama 2 family and the GPT-3.5 base model, presented at the official blog post.

Benchmark Results: Mixtral shines across numerous benchmarks, matching or outperforming both Llama 2 70B and GPT-3.5 on standard language tasks. This dominance extends to gracefully handling large contexts of 32k tokens, demonstrating its fluency and comprehension abilities.
Multilingualism: Mixtral isn't limited to English; it also operates in French, Italian, German, and Spanish, making it a versatile tool for diverse language requirements.
Code Generation: Mixtral exhibits outstanding performance in code generation tasks, offering developers a powerful ally in their creative endeavors.
Instruction Tasks: When fine-tuned for instruction-following tasks, Mixtral excels, achieving an impressive score of 8.3 on MT-Bench, highlighting its ability to accurately interpret and execute commands.

The quality versus inference budget tradeoff. Mistral 7B and Mixtral 8x7B belong to a family of highly efficient models compared to Llama 2 models-FriendliAI

Fig. 2: The quality versus inference budget tradeoff. Mistral 7B and Mixtral 8x7B belong to a family of highly efficient models compared to Llama 2 models, presented at the official blog post.

Unleashing Mixtral's Potential with Friendli

Mixtral's exceptional performance is further amplified by its exceptionally affordable pricing through the Friendli Inference. At just $0.4 per 1 million tokens, Friendli Serverless Endpoints stands as the cheapest option in the market for serving Mixtral. Any of its similar counterparts incur significantly higher costs, with GPT-4 reaching up to more than 250 times the price for the same number of tokens.

Experience Mixtral's Power Today!

Mixtral is readily available and accessible on Friendli Serverless Endpoints. Embrace the cutting-edge technology of MoE and witness the power of Mixtral at the cheapest price and fastest performance in the market. Visit Friendli Serverless Endpoints and unlock the full potential of this remarkable language model!

Cheapest Mixtral in the market: Pay at least 25% less than other providers and more than 250x less than GPT-4 for the same number of tokens.
Immediate access: Experience Mixtral's power right now through Friendli Serverless Endpoints.

Ready to revolutionize your AI endeavors? Try out Mixtral today on Friendli Serverless Endpoints!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.

February 2, 2024
4 min read

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): LLM Inference Serving Acceleration

Attention

LLM Inference

Benchmarks

January 12, 2024
3 min read

LLM Serving Engine Comparative Analysis: Friendli Inference vs. vLLM vs. TensorRT-LLM

Llama