February 28, 2024
3 min read

Running Quantized Mixtral 8x7B on a Single GPU

Building on our previous article, let's revisit the power of Mixture of Experts (MoE). Mixtral, an MoE model from Mistral AI, enables the model to process language as quickly as a 12-billion parameter model, despite having 4 times the effective parameters. This efficiency comes from its design, containing the "router" that assigns each token to the most suitable "expert" sub-model, using only a portion of its parameters during the inference job. This way, the MoE model is significantly faster than similar-sized dense models.

While MoE uses fewer parameters during inference, it still needs to store all of them in GPU memory, which can be a limitation for models with a large number of parameters, especially when considering accessibility. This is where quantization comes into play. Quantization reduces the precision of model calculations, allowing them to run on less powerful hardware, making large language models (LLMs) more accessible. Among the different quantization methods, we have chosen Activation-Aware Weight Quantization (AWQ) due to its optimal balance between speed and precision (refer to our previous blog article comparing quantization methods).

To demonstrate this in practice, we ran the AWQ-ed version of the Mixtral-7x8B-Instruct v0.1 model on a single NVIDIA A100 80GB GPU and evaluated its performance on the Databricks Dolly dataset. We compared the performance of Mixtral served on the Friendli Inference with a baseline vLLM system. Here, we'll delve into two key performance metrics:

Time-to-First-Token (TTFT): This metric highlights the speed of the initial response after a user submits their query. It's crucial for services requiring real-time interactions, such as chatbots or virtual assistants.
Time Per Output Token (TPOT): This metric reflects the speed at which subsequent tokens in the response are generated for each user. Ideally, TPOT should be faster than human reading speed for a seamless user experience.

The results are compelling:

AWQ-ed Mistral-8x7B-Instruct v0.1 on an NVIDIA's A100 80GB GPU Databricks Dolly dataset TTFT comparison-FriendliAI

TTFT highlights:

Friendli Inference is at least 4.1x faster compared to vLLM.
This performance gap widens significantly with increasing input load, demonstrating Friendli Inference's ability to consistently handle higher volumes efficiently.

AWQ-ed Mistral-8x7B-Instruct v0.1 on an NVIDIA's A100 80GB GPU Databricks Dolly dataset TPOT comparison-friendliai

TPOT highlights:

Friendli Inference is 3.8x~23.8x faster than vLLM across all tested input levels.
Similar to TTFT, the performance gap widens under heavier input loads, showcasing Friendli Inference's amazing scalability.

AWQ-ed Mistral-8x7B-Instruct v0.1 on an NVIDIA's A100 80GB GPU Databricks Dolly dataset throughput comparison-friendliai

Throughput highlights:

We put a latency constraint for both engines to achieve reasonable queueing delay and human reading speed and measured the achieved throughput under the same latency constraint. While the input loads differed due to the performance differences between the inference serving engines, Friendli Inference achieved a remarkable 43.8x improvement in throughput compared to vLLM, even with the same latency requirements.

In conclusion, Friendli Inference, powered by Mixtral with AWQ, offers significant speed advantages while maintaining accuracy and reducing costs. This opens doors to wider accessibility and real-time interactions with LLMs, paving the way for a more efficient and cost-effective future for AI applications.

Ready to Unleash the Power of Your LLM? Experience Friendli Inference's performance! We offer three options to suit your preferences:

Friendli Container: Deploy the engine on your own infrastructure for ultimate control.
Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.

Visit https://friendli.ai/try-friendli to begin your journey into the world of high-performance LLM serving with the Friendli Inference!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.