- December 1, 2025
- 6 min read
FriendliAI Achieves 3× Faster Qwen3 235B Inference Compared to vLLM Infrastructure

Infrastructure Meets Next-Gen Models
As foundation models grow even larger, the question is no longer just “how smart is the model?” but “how well can you run it in production?”.
When you’re deploying a model like Qwen3 235B, a 235-billion-parameter mixture-of-experts (MoE) model, every GPU cycle, every memory access, every expert-routing decision matters.
That’s why infrastructure is the differentiator. At FriendliAI, our platform is purpose-built to optimize large-scale MoE inference, which is exactly why we decided to benchmark Qwen3 235B and measure how we stack up against a baseline vLLM deployment on standard inference infrastructure.
In this post, you’ll learn:
- What makes Qwen3 235B a breakthrough in model architecture
- How Qwen3 235B can be utilized
- Why large MoE models present unique deployment challenges
- How FriendliAI addresses those challenges and how that shows in benchmark results
- How you can run Qwen3 235B on FriendliAI, quickly and cost-efficiently
Want to skip ahead to the performance benchmark results? → (scroll down to benchmark)
Qwen3 235B: The Next Frontier in MoE Models
Qwen3-235B-A22B-Thinking-2507 is a Mixture-of-Experts (MoE) reasoning model optimized for deep, multi-step problem solving. With 235B total parameters and ~22B active parameters per token, it delivers strong reasoning performance while maintaining efficient inference through sparse expert activation across 128 experts.
The Thinking-2507 variant introduces refined expert routing and enhanced reasoning stability, improving structured deliberation and consistency in long-form analysis.
Benchmark Results
Across GPQA, AIME25, LiveCodeBench v6, HLE, and Arena-Hard v2, Qwen3-235B-A22B-Thinking-2507 demonstrates clear performance gains over its competitive positioning against Gemini 2.5 Pro, OpenAI o4-mini, and DeepSeek-R1-0528.
Key highlights:
- GPQA: 81.1 – +10 point improvement vs base model
- AIME25: 92.3 – Highest among all compared models
- LiveCodeBench v6: 74.1 – Strong real-world coding reasoning
- Arena-Hard v2: 79.7 – Top win rate in human-aligned evaluations
- HLE: 18.2 – Improved nuanced reasoning over base variant
The most significant gains appear in mathematical and scientific reasoning, reflecting optimized expert routing and internal reasoning calibration.

Figure 1: Qwen3-235B-A22B Benchmark.
Reference: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507.
Why MoE matters
The MoE (Mixture-of-Experts) architecture enables conditional computation: for each input token, only a subset of experts are activated. This means far less computation resources are used for each request, significantly improving its memory efficiency while preserving, and sometimes even enhancing, its accuracy.
With Qwen3 235B, we see:
- 128 total experts, with 8 activated per token
- No shared experts (unlike predecessors) enabling sharper specialization
- Sophisticated routing and GQA (Grouped Query Attention) built into the architecture
In short, you're getting the capacity of a 235B-parameter model, but only the active parts are used for each token. While this is a powerful approach, it still has some challenges for production usage, which we will discuss later in this blog.
Use Cases:
Coding
Qwen3’s MoE architecture expands model capacity while keeping computation efficient by activating only a small set of experts per token. This additional capacity helps the model handle complex tasks, including coding, more effectively.
This makes it well-suited for complex software engineering environments where consistency, scalability, and speed are critical.
- Large-scale code generation and refactoring
- Multi-file dependency analysis and structural reasoning
- Automated debugging and root-cause identification
- Performance optimization and bottleneck detection
- Incremental codebase maintenance and modernization
By distributing coding responsibilities across domain-specialized experts, Qwen3 maintains high accuracy, stable output quality, and efficient resource usage, even when operating on extensive repositories or multi-language systems.
Multilingual Intelligence
Qwen3-235B’s MoE architecture is well suited for multilingual workloads, increasing capacity while keeping computation efficient by activating only a subset of experts per token. In practice, the router often learns to favor different experts for different linguistic patterns, helping the model capture language-specific syntax and nuance. This dynamic routing improves overall accuracy and efficiency for multilingual tasks.
- High-precision translation with stronger contextual fidelity
- Cross-lingual search and multilingual summarization
- Stable handling of mixed-language (code-switching) inputs
- Culturally aware responses for global customer applications
By activating only the relevant language experts, Qwen3 delivers higher linguistic accuracy with lower computational overhead, making it ideal for scalable global AI systems.
Why Deployment is Difficult
Deploying a Mixture of Experts (MoE) model at scale is inherently complex, due to several key challenges:
- Expert Routing Overhead & Uneven GPU Loads: Efficiently directing inputs to the right expert while maintaining performance can cause significant overhead, leading to uneven GPU resource utilization.
- Latency Spikes: Speed of expert switching heavily depends on inter-GPU communication, often resulting in unpredictable latency spikes that hinder real-time performance.
- Quantization & Mixed-Precision Issues: Scaling large MoE models requires careful handling of quantization and mixed-precision techniques to ensure both accuracy and efficiency, which can be tricky to manage effectively.
While many machine learning frameworks are optimized for dense models, MoE introduces an additional layer of complexity. That's where FriendliAI steps in. By providing tailored infrastructure that addresses these unique challenges, FriendliAI becomes the key enabler for deploying MoE models at scale with efficiency and reliability.
Why FriendliAI: Performance, Reliability, Cost, and Convenience
We benchmarked Qwen3 235B on FriendliAI’s platform and compared it against a platform built on vLLM . The results speak for themselves.

Figure 2: Qwen3-235B Benchmark. Reference: FriendliAI
Scenario (Input / Output Tokens)
| Scenario (Input / Output Tokens) | vLLM FP8 | FriendliAI 8-bit | FriendliAI 4-bit |
|---|---|---|---|
| 4000 / 500 | 1.00× | 1.33× | 1.98× |
| 500 / 4000 | 1.00× | 2.11× | 3.26× |
Table 1: Qwen3-235B Benchmark
These results illustrate that FriendliAI delivers superior throughput and efficiency on large-scale MoE models like Qwen3 235B. In particular, long-output scenarios (such as creative writing and document generation) benefit significantly.
FriendliAI’s performance is driven by:
- MoE-aware routing and execution – Our kernel scheduling optimizes expert activations, reducing per-token overhead.
- Optimized quantization pipeline – Our Online Quantization technique supports 8-bit and 4-bit with minimal impact on quality, enabling higher throughput without compromising results on the fly.
- End-to-end infrastructure efficiency – From memory management to GPU utilization, we’ve tuned every layer for large-model performance at scale.
- Cost-efficient token throughput – Higher effective throughput means lower cost per generated token, a tangible ROI for customers.
Furthermore, FriendliAI offers:
- Enterprise-grade reliability – With a 99.99 % uptime SLA, we deliver stable production performance, which is critical when running high-value workloads.
- Global, compliance-ready deployment – Geo-distributed infrastructure, data-locality controls, and governance features for regulated industries.
- Intelligent autoscaling – Our autoscaling system automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs.
How to Use Qwen3 235B on FriendliAI
Running Qwen3 235B on FriendliAI is straightforward:
- Log into the FriendliAI Suite.
- Navigate to the model catalog and select Qwen3 235B.
- Choose your desired precision: 8-bit or 4-bit.
- Select your deployment mode: Dedicated Endpoint (high-performance) or Serverless Endpoint (cost-efficient).
- Connect via API and begin your inference. Your routing, quantization, scaling, and GPU scheduling are handled by FriendliAI.
You can send requests using any OpenAI-compatible inference API. For example, using Friendli Python SDK:
Within minutes you’re running at peak performance with minimal operational overhead. FriendliAI handles all backend optimization, scaling, and performance tuning allowing teams to focus purely on building AI applications.
Get Started Today
Ready to take advantage of Qwen3 235B on FriendliAI?
- Deploy instantly via the FriendliAI console.
- Ramp up enterprise-grade workloads with minimal effort.
- Achieve faster inference, lower cost and higher throughput.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

