(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • February 20, 2024
  • 3 min read

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model

Serving Performances of Mixtral 8x7B, a Mixture of Experts (MoE) Model thumbnail

Mixtral, an innovative language model trained by Mistral AI, takes efficiency to a whole new level. Built upon the foundation of Mistral 7B, it incorporates a technique called Mixture of Experts (MoE) to pack the power of 8 "expert" models into one. While we have also featured the Mixtral model in our previous blog article, we evaluate the actual serving performances of the model in terms of latency and throughput in this article.

How Mixture of Experts (MoE) Works

MoE replaces some Feed-Forward layers with a sparse MoE layer. This means that the model replaces a dense centralized module into many expert modules, while also providing a "router" to assign each piece of information (token) to the expert best suited to handle it. For each token in Mixtral, two experts are selected, to be activated to tackle each token simultaneously, allowing the model to process language as quickly as a 12B parameter model, despite having 4x the effective parameters!

MoE layer from the Switch Transformers paper-FriendliAI

But what's MoE really about?

  • Efficient Pre-training: MoE can be trained faster and more efficiently compared to traditional dense models, allowing for building larger and more powerful models within the same budget.
  • Faster Inference: Despite having many parameters, MoE only uses a portion of its parameters during inference, making it significantly faster than similar-sized dense models.

Challenges of MoE

  • Training: Generalizing MoE models for various tasks can be tricky, leading to overfitting. Striking a balance between efficient pre-training and robust fine-tuning is key.
  • Memory Requirements: Although MoE uses fewer parameters during inference, it still needs to store all of them in GPU memory, demanding high GPU memory. This can be a limitation for models with a large number of parameters.

The Model Size of the Mixtral Model and Memory Requirements

Mixtral-8x7B might sound like an ensemble of eight 7B parameter models, but it's not! As only specific layers (i.e., the feed-forward blocks) are branches off to multiple experts, while the others are shared, it results in a total of 45B parameters, not 56B.

Beyond Parameters: Serving Performance Matters

Just like any model, Mixtral's real-world performance depends on how it's served. While MoE itself improves efficiency, ensuring smooth service level objectives (SLOs) through efficient serving engines is crucial for optimal user experience and cost-effectiveness.

Evaluation Results for Serving the Mixtral Model

This section evaluates the serving performance of the Mixtral 8x7B instruct model using the Databricks Dolly dataset on 4 NVIDIA A100 80GB GPUs. We focus on two key metrics:

  • 90th-percentile Latency: Measures the latency of the responses for 90% of the requests sent to the serving engine. Lower latency means faster responses.
  • Achieved Throughput: Measures the throughput of the serving engine (i.e., how many requests it can handle) given a specific latency requirement (i.e., SLO). Higher throughput means that it can handle more requests.

We compare the performance of Mixtral served on the Friendli Engine with a baseline vLLM system.

Mixtral-8x7B-Instruct-v0.1 model on NVIDIA's A100 80GB GPUs Databricks Dolly dataset p90 latency comparison-FriendliAI

Latency Highlights:

  • Up to 31.6x lower (i.e., faster) latency under high load (10N): Mixtral with Friendli Engine significantly outperforms the baseline under heavy workloads.
  • Consistent performance: Friendli Engine stably delivers faster response times across all tested input levels, and makes each GPU worthwhile.

Mixtral-8x7B-Instruct-v0.1 model on NVIDIA's A100 80GB GPUs Databricks Dolly dataset achieved throughput comparison-FriendliAI

Throughput Highlights:

  • Up to 7.6x higher throughput: Like the p90 latency, Friendli Engine delivers significant throughput improvements. Friendli Engine is consistently able to handle more tokens than the baseline across all latency requirements.

In summary, the Mixtral 8x7B instruct model served on the Friendli Engine demonstrates significantly faster response times and token generation throughput compared to the baseline vLLM system, especially under high load conditions. This makes Mixtral and the Friendli Engine a compelling option for applications requiring real-time text generation.

In essence, Mixtral and the Friendli Engine represent a significant step forward in language modeling, offering both speed and efficiency through its innovative MoE approach. Understanding its benefits and challenges for serving the model paves the way for further advancements in this exciting field.

Ready to Unleash the Power of Your LLM? Experience Friendli Engine's performance! We offer three options to suit your preferences:

Visit https://friendli.ai/try-friendli/ to begin your journey into the world of high-performance LLM serving with the Friendli Engine!


Written by

FriendliAI logo

FriendliAI Tech & Research


Share