February 20, 2024
3 min read

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model

Mixtral, an innovative language model trained by Mistral AI, takes efficiency to a whole new level. Built upon the foundation of Mistral 7B, it incorporates a technique called Mixture of Experts (MoE) to pack the power of 8 "expert" models into one. While we have also featured the Mixtral model in our previous blog article, we evaluate the actual serving performances of the model in terms of latency and throughput in this article.

How Mixture of Experts (MoE) Works

MoE replaces some Feed-Forward layers with a sparse MoE layer. This means that the model replaces a dense centralized module into many expert modules, while also providing a "router" to assign each piece of information (token) to the expert best suited to handle it. For each token in Mixtral, two experts are selected, to be activated to tackle each token simultaneously, allowing the model to process language as quickly as a 12B parameter model, despite having 4x the effective parameters!

MoE layer from the Switch Transformers paper-FriendliAI

MoE layer from the Switch Transformers paper

But what's MoE really about?

Efficient Pre-training: MoE can be trained faster and more efficiently compared to traditional dense models, allowing for building larger and more powerful models within the same budget.
Faster Inference: Despite having many parameters, MoE only uses a portion of its parameters during inference, making it significantly faster than similar-sized dense models.

Challenges of MoE

Training: Generalizing MoE models for various tasks can be tricky, leading to overfitting. Striking a balance between efficient pre-training and robust fine-tuning is key.
Memory Requirements: Although MoE uses fewer parameters during inference, it still needs to store all of them in GPU memory, demanding high GPU memory. This can be a limitation for models with a large number of parameters.

The Model Size of the Mixtral Model and Memory Requirements

Mixtral-8x7B might sound like an ensemble of eight 7B parameter models, but it's not! As only specific layers (i.e., the feed-forward blocks) are branches off to multiple experts, while the others are shared, it results in a total of 45B parameters, not 56B.

Beyond Parameters: Serving Performance Matters

Just like any model, Mixtral's real-world performance depends on how it's served. While MoE itself improves efficiency, ensuring smooth service level objectives (SLOs) through efficient serving engines is crucial for optimal user experience and cost-effectiveness.

Evaluation Results for Serving the Mixtral Model

This section evaluates the serving performance of the Mixtral 8x7B instruct model using the Databricks Dolly dataset on 4 NVIDIA A100 80GB GPUs. We focus on two key metrics:

90th-percentile Latency: Measures the latency of the responses for 90% of the requests sent to the serving engine. Lower latency means faster responses.
Achieved Throughput: Measures the throughput of the serving engine (i.e., how many requests it can handle) given a specific latency requirement (i.e., SLO). Higher throughput means that it can handle more requests.

We compare the performance of Mixtral served on the Friendli Inference with a baseline vLLM system.

Mixtral-8x7B-Instruct-v0.1 model on NVIDIA's A100 80GB GPUs Databricks Dolly dataset p90 latency comparison-FriendliAI

Latency Highlights:

Up to 31.6x lower (i.e., faster) latency under high load (10N): Mixtral with Friendli Inference significantly outperforms the baseline under heavy workloads.
Consistent performance: Friendli Inference stably delivers faster response times across all tested input levels, and makes each GPU worthwhile.

Mixtral-8x7B-Instruct-v0.1 model on NVIDIA's A100 80GB GPUs Databricks Dolly dataset achieved throughput comparison-FriendliAI

Throughput Highlights:

Up to 7.6x higher throughput: Like the p90 latency, Friendli Inference delivers significant throughput improvements. Friendli Inference is consistently able to handle more tokens than the baseline across all latency requirements.

In summary, the Mixtral 8x7B instruct model served on the Friendli Inference demonstrates significantly faster response times and token generation throughput compared to the baseline vLLM system, especially under high load conditions. This makes Mixtral and the Friendli Inference a compelling option for applications requiring real-time text generation.

In essence, Mixtral and the Friendli Inference represent a significant step forward in language modeling, offering both speed and efficiency through its innovative MoE approach. Understanding its benefits and challenges for serving the model paves the way for further advancements in this exciting field.

Ready to Unleash the Power of Your LLM? Experience Friendli Inference's performance! We offer three options to suit your preferences:

Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.
Friendli Container: Deploy the engine on your own infrastructure for ultimate control.

Visit https://friendli.ai/try-friendli to begin your journey into the world of high-performance LLM serving with the Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.