(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());
  • February 28, 2024
  • 3 min read

Running Quantized Mixtral 8x7B on a Single GPU

Running Quantized Mixtral 8x7B on a Single GPU thumbnail

Building on our previous article, let's revisit the power of Mixture of Experts (MoE). Mixtral, an MoE model from Mistral AI, enables the model to process language as quickly as a 12-billion parameter model, despite having 4 times the effective parameters. This efficiency comes from its design, containing the "router" that assigns each token to the most suitable "expert" sub-model, using only a portion of its parameters during the inference job. This way, the MoE model is significantly faster than similar-sized dense models.

While MoE uses fewer parameters during inference, it still needs to store all of them in GPU memory, which can be a limitation for models with a large number of parameters, especially when considering accessibility. This is where quantization comes into play. Quantization reduces the precision of model calculations, allowing them to run on less powerful hardware, making large language models (LLMs) more accessible. Among the different quantization methods, we have chosen Activation-Aware Weight Quantization (AWQ) due to its optimal balance between speed and precision (refer to our previous blog article comparing quantization methods).

To demonstrate this in practice, we ran the AWQ-ed version of the Mixtral-7x8B-Instruct v0.1 model on a single NVIDIA A100 80GB GPU and evaluated its performance on the Databricks Dolly dataset. We compared the performance of Mixtral served on the Friendli Engine with a baseline vLLM system. Here, we'll delve into two key performance metrics:

  • Time-to-First-Token (TTFT): This metric highlights the speed of the initial response after a user submits their query. It's crucial for services requiring real-time interactions, such as chatbots or virtual assistants.
  • Time Per Output Token (TPOT): This metric reflects the speed at which subsequent tokens in the response are generated for each user. Ideally, TPOT should be faster than human reading speed for a seamless user experience.

The results are compelling:

AWQ-ed Mistral-8x7B-Instruct v0.1 on an NVIDIA's A100 80GB GPU Databricks Dolly dataset TTFT comparison-FriendliAI

TTFT highlights:

  • Friendli Engine is at least 4.1x faster compared to vLLM.
  • This performance gap widens significantly with increasing input load, demonstrating Friendli Engine's ability to consistently handle higher volumes efficiently.

AWQ-ed Mistral-8x7B-Instruct v0.1 on an NVIDIA's A100 80GB GPU Databricks Dolly dataset TPOT comparison-friendliai

TPOT highlights:

  • Friendli Engine is 3.8x~23.8x faster than vLLM across all tested input levels.
  • Similar to TTFT, the performance gap widens under heavier input loads, showcasing Friendli Engine's amazing scalability.

AWQ-ed Mistral-8x7B-Instruct v0.1 on an NVIDIA's A100 80GB GPU Databricks Dolly dataset throughput comparison-friendliai

Throughput highlights:

We put a latency constraint for both engines to achieve reasonable queueing delay and human reading speed and measured the achieved throughput under the same latency constraint. While the input loads differed due to the performance differences between the inference serving engines, Friendli Engine achieved a remarkable 43.8x improvement in throughput compared to vLLM, even with the same latency requirements.

In conclusion, Friendli Engine, powered by Mixtral with AWQ, offers significant speed advantages while maintaining accuracy and reducing costs. This opens doors to wider accessibility and real-time interactions with LLMs, paving the way for a more efficient and cost-effective future for AI applications.

Ready to Unleash the Power of Your LLM? Experience Friendli Engine's performance! We offer three options to suit your preferences:

Visit https://friendli.ai/try-friendli/ to begin your journey into the world of high-performance LLM serving with the Friendli Engine!


Written by

FriendliAI logo

FriendliAI Tech & Research


Share

Related Posts

See all from blog