December 1, 2025
6 min read

FriendliAI Achieves 3× Faster Qwen3 235B Inference Compared to vLLM Infrastructure

TL;DR

FriendliAI achieves 3× faster inference for Qwen3 235B compared to standard vLLM infrastructure, particularly in long-output scenarios like creative writing.
Through Online Quantization (supporting 4-bit and 8-bit), FriendliAI delivers a 50%+ reduction in GPU costs while maintaining model accuracy.
The infrastructure provides a 99.99% uptime SLA and intelligent autoscaling, ensuring stable production performance for mission-critical agentic workflows.

FriendliAI Achieves 3× Faster Qwen3 235B Inference Compared to vLLM Infrastructure thumbnail

Infrastructure Meets Next-Gen Models

As foundation models grow even larger, the question is no longer just “how smart is the model?” but “how well can you run it in production?”.

When you’re deploying a model like Qwen3 235B, a 235-billion-parameter mixture-of-experts (MoE) model, every GPU cycle, every memory access, every expert-routing decision matters.

That’s why infrastructure is the differentiator. At FriendliAI, our platform is purpose-built to optimize large-scale MoE inference, which is exactly why we decided to benchmark Qwen3 235B and measure how we stack up against a baseline vLLM deployment on standard inference infrastructure.

In this post, you’ll learn:

What makes Qwen3 235B a breakthrough in model architecture
How Qwen3 235B can be utilized
Why large MoE models present unique deployment challenges
How FriendliAI addresses those challenges and how that shows in benchmark results
How you can run Qwen3 235B on FriendliAI, quickly and cost-efficiently

Want to skip ahead to the performance benchmark results? → (scroll down to benchmark)

Qwen3 235B: The Next Frontier in MoE Models

Qwen3-235B-A22B-Thinking-2507 is a Mixture-of-Experts (MoE) reasoning model optimized for deep, multi-step problem solving. With 235B total parameters and ~22B active parameters per token, it delivers strong reasoning performance while maintaining efficient inference through sparse expert activation across 128 experts.

The Thinking-2507 variant introduces refined expert routing and enhanced reasoning stability, improving structured deliberation and consistency in long-form analysis.

Benchmark Results

Across GPQA, AIME25, LiveCodeBench v6, HLE, and Arena-Hard v2, Qwen3-235B-A22B-Thinking-2507 demonstrates clear performance gains over its competitive positioning against Gemini 2.5 Pro, OpenAI o4-mini, and DeepSeek-R1-0528.

Key highlights:

GPQA: 81.1 – +10 point improvement vs base model
AIME25: 92.3 – Highest among all compared models
LiveCodeBench v6: 74.1 – Strong real-world coding reasoning
Arena-Hard v2: 79.7 – Top win rate in human-aligned evaluations
HLE: 18.2 – Improved nuanced reasoning over base variant

The most significant gains appear in mathematical and scientific reasoning, reflecting optimized expert routing and internal reasoning calibration.

Figure 1: Qwen3-235B-A22B Benchmark.
Reference: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507.

Why MoE matters

The MoE (Mixture-of-Experts) architecture enables conditional computation: for each input token, only a subset of experts are activated. This means far less computation resources are used for each request, significantly improving its memory efficiency while preserving, and sometimes even enhancing, its accuracy.

With Qwen3 235B, we see:

128 total experts, with 8 activated per token
No shared experts (unlike predecessors) enabling sharper specialization
Sophisticated routing and GQA (Grouped Query Attention) built into the architecture

In short, you're getting the capacity of a 235B-parameter model, but only the active parts are used for each token. While this is a powerful approach, it still has some challenges for production usage, which we will discuss later in this blog.

Use Cases:

Coding

Qwen3’s MoE architecture expands model capacity while keeping computation efficient by activating only a small set of experts per token. This additional capacity helps the model handle complex tasks, including coding, more effectively.

This makes it well-suited for complex software engineering environments where consistency, scalability, and speed are critical.

Large-scale code generation and refactoring
Multi-file dependency analysis and structural reasoning
Automated debugging and root-cause identification
Performance optimization and bottleneck detection
Incremental codebase maintenance and modernization

By distributing coding responsibilities across domain-specialized experts, Qwen3 maintains high accuracy, stable output quality, and efficient resource usage, even when operating on extensive repositories or multi-language systems.

Multilingual Intelligence

Qwen3-235B’s MoE architecture is well suited for multilingual workloads, increasing capacity while keeping computation efficient by activating only a subset of experts per token. In practice, the router often learns to favor different experts for different linguistic patterns, helping the model capture language-specific syntax and nuance. This dynamic routing improves overall accuracy and efficiency for multilingual tasks.

High-precision translation with stronger contextual fidelity
Cross-lingual search and multilingual summarization
Stable handling of mixed-language (code-switching) inputs
Culturally aware responses for global customer applications

By activating only the relevant language experts, Qwen3 delivers higher linguistic accuracy with lower computational overhead, making it ideal for scalable global AI systems.

Why Deployment is Difficult

Deploying a Mixture of Experts (MoE) model at scale is inherently complex, due to several key challenges:

Expert Routing Overhead & Uneven GPU Loads: Efficiently directing inputs to the right expert while maintaining performance can cause significant overhead, leading to uneven GPU resource utilization.
Latency Spikes: Speed of expert switching heavily depends on inter-GPU communication, often resulting in unpredictable latency spikes that hinder real-time performance.
Quantization & Mixed-Precision Issues: Scaling large MoE models requires careful handling of quantization and mixed-precision techniques to ensure both accuracy and efficiency, which can be tricky to manage effectively.

While many machine learning frameworks are optimized for dense models, MoE introduces an additional layer of complexity. That's where FriendliAI steps in. By providing tailored infrastructure that addresses these unique challenges, FriendliAI becomes the key enabler for deploying MoE models at scale with efficiency and reliability.

Why FriendliAI: Performance, Reliability, Cost, and Convenience

We benchmarked Qwen3 235B on FriendliAI’s platform and compared it against a platform built on vLLM . The results speak for themselves.

Figure 2: Qwen3-235B Benchmark. Reference: FriendliAI

Scenario (Input / Output Tokens)

Scenario (Input / Output Tokens)	vLLM FP8	FriendliAI 8-bit	FriendliAI 4-bit
4000 / 500	1.00×	1.33×	1.98×
500 / 4000	1.00×	2.11×	3.26×

Table 1: Qwen3-235B Benchmark

These results illustrate that FriendliAI delivers superior throughput and efficiency on large-scale MoE models like Qwen3 235B. In particular, long-output scenarios (such as creative writing and document generation) benefit significantly.

FriendliAI’s performance is driven by:

MoE-aware routing and execution – Our kernel scheduling optimizes expert activations, reducing per-token overhead.
Optimized quantization pipeline – Our Online Quantization technique supports 8-bit and 4-bit with minimal impact on quality, enabling higher throughput without compromising results on the fly.
End-to-end infrastructure efficiency – From memory management to GPU utilization, we’ve tuned every layer for large-model performance at scale.
Cost-efficient token throughput – Higher effective throughput means lower cost per generated token, a tangible ROI for customers.

Furthermore, FriendliAI offers:

Enterprise-grade reliability – With a 99.99 % uptime SLA, we deliver stable production performance, which is critical when running high-value workloads.
Global, compliance-ready deployment – Geo-distributed infrastructure, data-locality controls, and governance features for regulated industries.
Intelligent autoscaling – Our autoscaling system automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs.

How to Use Qwen3 235B on FriendliAI

Running Qwen3 235B on FriendliAI is straightforward:

Log into the FriendliAI Suite.
Navigate to the model catalog and select Qwen3 235B.
Choose your desired precision: 8-bit or 4-bit.
Select your deployment mode: Dedicated Endpoint (high-performance) or Serverless Endpoint (cost-efficient).
Connect via API and begin your inference. Your routing, quantization, scaling, and GPU scheduling are handled by FriendliAI.

You can send requests using any OpenAI-compatible inference API. For example, using Friendli Python SDK:

python

import os

from friendli import SyncFriendli

FRIENDLI_TOKEN: str = os.environ["FRIENDLI_TOKEN"]


def main() -> None:
    with SyncFriendli(
        token=FRIENDLI_TOKEN,
    ) as friendli:
        res = friendli.serverless.chat.complete(
            messages=[
                {
                    "content": "You are a helpful assistant.",
                    "role": "system",
                },
                {
                    "content": "Hello!",
                    "role": "user",
                },
            ],
            model="Qwen/Qwen3-235B-A22B-Thinking-2507",
            max_tokens=200,
        )

        print(res)


if __name__ == "__main__":
    main()

Within minutes you’re running at peak performance with minimal operational overhead. FriendliAI handles all backend optimization, scaling, and performance tuning allowing teams to focus purely on building AI applications.

Get Started Today

Ready to take advantage of Qwen3 235B on FriendliAI?

Deploy instantly via the FriendliAI console.
Ramp up enterprise-grade workloads with minimal effort.
Achieve faster inference, lower cost and higher throughput.

👉 Start building with Qwen3 235B on FriendliAI now.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 520,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.