- February 28, 2024
- 3 min read
Running Quantized Mixtral 8x7B on a Single GPU
Building on our previous article, let's revisit the power of Mixture of Experts (MoE). Mixtral, an MoE model from Mistral AI, enables the model to process language as quickly as a 12-billion parameter model, despite having 4 times the effective parameters. This efficiency comes from its design, containing the "router" that assigns each token to the most suitable "expert" sub-model, using only a portion of its parameters during the inference job. This way, the MoE model is significantly faster than similar-sized dense models.
While MoE uses fewer parameters during inference, it still needs to store all of them in GPU memory, which can be a limitation for models with a large number of parameters, especially when considering accessibility. This is where quantization comes into play. Quantization reduces the precision of model calculations, allowing them to run on less powerful hardware, making large language models (LLMs) more accessible. Among the different quantization methods, we have chosen Activation-Aware Weight Quantization (AWQ) due to its optimal balance between speed and precision (refer to our previous blog article comparing quantization methods).
To demonstrate this in practice, we ran the AWQ-ed version of the Mixtral-7x8B-Instruct v0.1 model on a single NVIDIA A100 80GB GPU and evaluated its performance on the Databricks Dolly dataset. We compared the performance of Mixtral served on the Friendli Inference with a baseline vLLM system. Here, we'll delve into two key performance metrics:
- Time-to-First-Token (TTFT): This metric highlights the speed of the initial response after a user submits their query. It's crucial for services requiring real-time interactions, such as chatbots or virtual assistants.
- Time Per Output Token (TPOT): This metric reflects the speed at which subsequent tokens in the response are generated for each user. Ideally, TPOT should be faster than human reading speed for a seamless user experience.
The results are compelling:
TTFT highlights:
- Friendli Inference is at least 4.1x faster compared to vLLM.
- This performance gap widens significantly with increasing input load, demonstrating Friendli Inference's ability to consistently handle higher volumes efficiently.
TPOT highlights:
- Friendli Inference is 3.8x~23.8x faster than vLLM across all tested input levels.
- Similar to TTFT, the performance gap widens under heavier input loads, showcasing Friendli Inference's amazing scalability.
Throughput highlights:
We put a latency constraint for both engines to achieve reasonable queueing delay and human reading speed and measured the achieved throughput under the same latency constraint. While the input loads differed due to the performance differences between the inference serving engines, Friendli Inference achieved a remarkable 43.8x improvement in throughput compared to vLLM, even with the same latency requirements.
In conclusion, Friendli Inference, powered by Mixtral with AWQ, offers significant speed advantages while maintaining accuracy and reducing costs. This opens doors to wider accessibility and real-time interactions with LLMs, paving the way for a more efficient and cost-effective future for AI applications.
Ready to Unleash the Power of Your LLM? Experience Friendli Inference's performance! We offer three options to suit your preferences:
- Friendli Container: Deploy the engine on your own infrastructure for ultimate control.
- Friendli Dedicated Endpoints: Run any custom generative AI models on dedicated GPU instances in autopilot.
- Friendli Serverless Endpoints: No setup required, simply call our APIs and let us handle the rest.
Visit https://friendli.ai/try-friendli/ to begin your journey into the world of high-performance LLM serving with the Friendli Inference!
Written by
FriendliAI Tech & Research
Share