• August 26, 2025
  • 5 min read

The Rise of MoE: Comparing 2025’s Leading Mixture-of-Experts AI Models

The Rise of MoE: Comparing 2025’s Leading Mixture-of-Experts AI Models thumbnail

Introduction

As the demand for more efficient, scalable, and intelligent AI continues to surge, Mixture of Experts (MoE) architectures have become the leading strategy for achieving performance at scale. Unlike dense models where all parameters are active at inference, MoE models route inputs through only a subset of their “experts,” specialized neural pathways, dramatically reducing compute requirements while enabling massive parameter growth.

In this post, we explain briefly about what MoE is and compare several state-of-the-art MoE models released in 2025, including GPT-OSS(20B/120B), DeepSeek‑R1‑0528, LLaMA‑4 Maverick, Qwen3‑235B‑A22B. These models span different modalities, context lengths, and routing mechanisms, offering a glimpse into the evolving design landscape of sparse expert-based architectures.

What Is a Mixture-of-Experts?

In the context of transformer models, a Mixture-of-Experts(MoE) architecture consists of two main elements:

  • Sparse MoE layers that replace traditional dense feed-forward network (FFN) layers. An MoE layer contains multiple “experts” (e.g., 8), each of which is typically a feed-forward network. While experts are usually standard FFNs, they can also be more complex sub-networks or even MoEs themselves, leading to hierarchical MoEs.
  • A gate network(or router) that determines which experts handle which tokens. For example, in the illustration below, the token “The” is routed to the second expert, while “Dog” is routed to the first. Tokens can be sent to more than one expert, depending on the routing strategy. The router consists of learnable parameters and is trained jointly with the rest of the model.

Architectural Comparison Between Dense Model and Sparse Model (MoE).

Figure 1: Architectural Comparison Between Dense Model and Sparse Model (MoE). Reference: A Review of Sparse Expert Models in Deep Learning. [Online] Available: https://arxiv.org/pdf/2209.01667. [Accessed Aug 5, 2025]

The main advantage of this architecture is efficiency. During both pretraining and inference, only a small number of these experts are activated per token, significantly reducing the number of parameters that must be executed per step compared to the dense model.

With efficiency at the core of their design, modern MoE models vary in how they structure expert layers, routing strategies, and supported contexts. The table below outlines these differences across leading models.

Comparing 2025's Leading MoE Models

The table below summarizes the core architectural specifications of leading Mixture‑of‑Experts (MoE) models released in 2025, including parameter scale, expert configuration, context length and modality.

ModelTotal ParamsActivated ParamsExpert Pool SizeActive Experts per TokenContext LengthModality
GPT-OSS-120B117B5.1B1284128KText-to-Text
GPT-OSS-20B21B3.6B324128KText-to-Text
DeepSeek-R1-0528671B37B2569 (1 shared)128KText-to-Text
Llama-4 Maverick400B17B1282 (1 shared)1MImage-Text-to-Text
Llama-4 Scout109B17B162 (1 shared)10MImage-Text-to-Text
Qwen3-235B-A22B235B22B128832K (~131K YaRN)Text-to-Text
Qwen3-30B-A3B30.5B3.3B128832K (~131K YaRN)Text-to-Text
Figure 2: Comparative table of leading MoE models.

MoE Design Patterns and Trade-offs

In the earlier comparison, we looked at how these models perform side-by-side.

Here, we focus on the design patterns behind those results to see their similarities, differences, and the trade-offs each team made.

Understanding these patterns is key to predicting how a model will scale, adapt, and specialize in real-world deployments.

1. Routing Strategy - How tokens meet experts

Comparison of Mixture-of-Experts Routing Strategies.

Figure 3: Comparison of Mixture-of-Experts Routing Strategies. Reference: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. [Online] Available: https://arxiv.org/pdf/2401.06066. [Accessed Aug 5, 2025]

  • Top-k Routing Without Shared Experts

    • GPT-OSS (120B & 20B) uses top-4 routing from 128 experts (120B) or 32 experts (20B), with outputs weighted via softmax over the selected set.
    • Qwen3 (235B & 30B) uses top-8 routing from 128 experts per layer.

    In both designs, there is no shared expert, maximizing specialization by allowing each expert to develop more distinct capabilities while simplifying scaling.

  • Hybrid Top-k Routing With Shared Experts

    • DeepSeek-R1-0528 activates 1 shared expert for all tokens plus 8 routed experts chosen from a 256-expert pool.
    • LLaMA-4 Maverick and Scout activate 1 shared expert plus 1 routed expert, Maverick from 128 and Scout from 16, while the shared path stabilizes generalization and token-level routing preserves specialization.

    Here, the shared expert stabilizes generalization, while the routed pathway enables token-level specialization.

2. Expert Pool Size - Specialization vs. Efficiency

  • Large Routed Pools for Fine-Grained Specialization: GPT-OSS-120B (128 experts), Qwen3 (128 experts), DeepSeek-R1-0528 (256 experts), and LLaMA-4 Maverick (128 experts) provide high diversity in expert capabilities.

    A bigger pool of experts allows each expert to specialize in narrower patterns, improving performance across diverse inputs.

  • Compact Routed Pools for Efficiency: GPT-OSS-20B (32 experts) and LLaMA-4 Scout (16 experts) shrink the total parameter footprint while keeping the same activation pattern, reducing storage and training overhead, though per-token inference compute may remain similar.

Across these architectures, two core themes emerge:

  1. With Shared Experts vs. Without Shared Experts: balancing maximum specialization with stable generalization.
  2. Expert pool size: trading fine-grained capability for parameter efficiency, especially during training.

The varied approaches seen in GPT-OSS, DeepSeek, LLaMA-4, and Qwen3 show there's no single “best” MoE design, only trade-offs tailored to different deployment goals.

Whether the priority is cost-efficient scaling, multimodal reasoning, ultra-long context, or adaptive compute usage, MoE architectures are proving to be one of the most versatile tools for building next-generation AI systems.

Quantization – Making Massive MoEs Deployable

While sparse routing architecture of MoE reduces the number of parameters processed per token, the size of each active expert still affects memory and throughput.

This is where quantization comes in: by lowering numerical precision, cutting memory use, and speeding up inference without major accuracy loss.

Here's how models above leverage quantization:

  • GPT‑OSS uses native MXFP4 for MoE layers, enabling the 120B model to run on a single 80 GB H100, and the 20B version in just 16 GB of memory.
  • ​​DeepSeek-R1-0528 offers FP4 and even ultra-compressed 1.78-bit versions for lightweight deployment.
  • LLaMA‑4 Maverick was released in BF16 weights, with an FP8 quantized version also available on Hugging Face. FP8 makes deployment on modern GPU clusters (e.g., H100 setups) more practical while keeping quality intact.
  • LLaMA‑4 Scout starts with BF16 weights but supports on-the-fly INT4 quantization, drastically reducing memory usage while staying fast enough for H100 inference.
  • Qwen3 also supports FP8 quantization, enabling leaner deployment across hardware platforms like GPUs and inference engines.

We'll dive deeper into these techniques in a future post, so consider this a teaser for how quantization makes massive MoE models truly practical.

In the meantime, try out our Online Quantization on Dedicated Endpoints to experience its benefits firsthand, which automatically compresses your model at load-time, cuts GPU costs by 2–4×, and speeds up inference, as introduced in our previous blog post.

Ready to explore these cutting‑edge MoE models for yourself?

You can try them directly in Friendli Suite and see how they perform in your own workflows. Click the links below to deploy each model instantly:

Experience the latest MoE innovation firsthand, from ultra-long context handling to advanced multimodal reasoning, and bring state-of-the-art AI into your projects today.


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.