Faster and Cheaper Mixtral 8×7B on Friendli Serverless Endpoints

Blog post thumbnail

The world of large language models (LLMs) is teeming with innovation, and the latest offering from Mistral AI, the Mixtral model, is gaining focus due to its groundbreaking approach. But what exactly makes Mixtral stand out in the ever-growing crowd of AI models? Let's delve into its core principles and impressive performance metrics to understand why it deserves your attention, and on how to actually try out the model for yourself.

MoE: The Secret Sauce Behind Mixtral

At its heart, Mixtral leverages a powerful technique called Mixture of Experts (MoE). Unlike traditional monolithic models, MoE divides its processing power into 8 distinct groups of parameters, called "experts". At each layer, for each token, Mixtral employs a router network that intelligently selects two of these experts to handle the token's information. Their outputs are then combined additively, resulting in a richer and more nuanced understanding of the input.

This clever approach delivers two significant advantages:

  • Boosted Parameter Power: The total parameter count of Mixtral reaches 46.7 billion. However, the MoE architecture ensures that only a fraction of these parameters are actually used per token, making it computationally efficient. Concretely, Mixtral utilizes just 12.9 billion parameters per token, resulting in processing speeds and costs equivalent to a 12.9 billion model. In essence, you get the benefits of a massive model without the associated hefty hardware demands.

  • Cost-Effectiveness and Latency Reduction: By employing only a fraction of its parameters per token, Mixtral delivers astoundingly fast inference speeds. Compared to the widely-used Llama 2 70B model, Mixtral boasts 6x faster inference while consistently matching or surpassing its performance on various benchmarks.

Mixtral's Performance

The success of any model hinges on its actual capabilities, and Mixtral shows good performances. Here’s a glimpse of its impressive achievements, taken from its official website.

Fig. 1: Comparison of Mixtral to the Llama 2 family and the GPT-3.5 base model, presented at the official blog post.

  • Benchmark Results: Mixtral shines across numerous benchmarks, matching or outperforming both Llama 2 70B and GPT-3.5 on standard language tasks. This dominance extends to gracefully handling large contexts of 32k tokens, demonstrating its fluency and comprehension abilities.

  • Multilingualism: Mixtral isn't limited to English; it also operates in French, Italian, German, and Spanish, making it a versatile tool for diverse language requirements.

  • Code Generation: Mixtral exhibits outstanding performance in code generation tasks, offering developers a powerful ally in their creative endeavors.

  • Instruction Tasks: When fine-tuned for instruction-following tasks, Mixtral excels, achieving an impressive score of 8.3 on MT-Bench, highlighting its ability to accurately interpret and execute commands.

Fig. 2: The quality versus inference budget tradeoff. Mistral 7B and Mixtral 8x7B belong to a family of highly efficient models compared to Llama 2 models, presented at the official blog post.

Unleashing Mixtral's Potential with Friendli

Mixtral's exceptional performance is further amplified by its exceptionally affordable pricing through the Friendli Engine. At just $0.4 per 1 million tokens, Friendli Serverless Endpoints stands as the cheapest option in the market for serving Mixtral. Any of its similar counterparts incur significantly higher costs, with GPT-4 reaching up to more than 250 times the price for the same number of tokens.

Experience Mixtral's Power Today!

Mixtral is readily available and accessible on Friendli Serverless Endpoints. Embrace the cutting-edge technology of MoE and witness the power of Mixtral at the cheapest price and fastest performance in the market. Visit Friendli Serverless Endpoints and unlock the full potential of this remarkable language model!

  • Cheapest Mixtral in the market: Pay at least 25% less than other providers and more than 250x less than GPT-4 for the same number of tokens.

  • Immediate access: Experience Mixtral's power right now through Friendli Serverless Endpoints.

Ready to revolutionize your AI endeavors? Try out Mixtral today on Friendli Serverless Endpoints!



Share

Related Posts

thumbnail
  • February 2, 2024
  • 4 min read

Grouped Query Attention (GQA) vs. Multi Head Attention (MHA): Optimizing LLM Inference Serving

GQA
MHA
MQA
thumbnail
  • January 12, 2024
  • 3 min read

The LLM Serving Engine Showdown: Friendli Engine Outshines

LLM
Serving Engine
See all from blog