May 15, 2026
5 min read

What's So Special About DeepSeek V4? Find Out On FriendliAI

TL;DR

Run DeepSeek-V4-Pro (1.6T MoE / 49B active) or DeepSeek-V4-Flash (284B MoE / 13B active) on Friendli Dedicated Endpoints.
Get a 1M-token context window by default, with three reasoning effort modes (Non-think, Think High, Think Max) on both models.
DeepSeek-V4-Pro with Think Max enabled led the open-weight field on key benchmarks: 93.5 LiveCodeBench, 3206 Codeforces, and 80.6 SWE Verified.
DeepSeek-V4-Flash achieved scores comparable to Claude Sonnet 4.6 on the Artificial Analysis Intelligence Index at a fraction of the compute footprint.
Choose Flash for high-throughput agents and routine reasoning or Pro when the workload needs frontier-grade depth.

What's So Special About DeepSeek V4? Find Out On FriendliAI thumbnail

DeepSeek's V4 Preview is one of the most exciting releases of the year for open-weight inference. In the few weeks since launch, DeepSeek-V4-Pro and DeepSeek-V4-Flash have surged to the top of the most-used open-weight models on OpenRouter.

The headline isn't that DeepSeek shipped two new Mixture-of-Experts (MoE) models. It's that the smaller of the two, DeepSeek-V4-Flash, lands near the closed-frontier capability curve at a fraction of the compute footprint, changing the math on what's economically worth running at scale. Meanwhile, DeepSeek-V4-Pro competes with flagship foundation models and leads every open-weight model on agentic coding. Both are live on Friendli Dedicated Endpoints.

Friendli Dedicated Endpoints are purpose-built for these types of workloads, serving high-performance inference for frontier open-weight models with 2–5x faster output speeds and 50–90% lower GPU costs than standard self-hosted alternatives. Follow along to learn what makes V4 special.

Two models, one architecture

DeepSeek-V4-Pro and DeepSeek-V4-Flash are text-in, text-out language models. No vision or audio in this preview — DeepSeek's pitch is depth and context length, not modality breadth. Both expose three reasoning effort modes: Non-think for fast intuitive replies, Think High for deliberate logical analysis, and Think Max for the longest deliberation budget. Both models support a 1-million-token context window out of the box. DeepSeek recommends a context of at least 384K when using Think Max.

DeepSeek-V4-Flash: near-frontier capability at a fraction of the compute

DeepSeek-V4-Flash is among few open-weight models to make near-frontier reasoning efficient enough to run by default. With 284B total parameters, 13B active per token, and Think Max enabled, it scores 47 on the Artificial Analysis Intelligence Index — at the time of release, comparable to Claude Sonnet 4.6 at maximum reasoning effort and 17 points above the average model in its class. It posts 91.6 on LiveCodeBench, 88.1 on GPQA Diamond, 86.2 on MMLU-Pro, and 79.0 on SWE-bench Verified, landing within two points of DeepSeek-V4-Pro on most reasoning evals.

Friendli Dedicated Endpoints compound the advantage. Serving DeepSeek-V4-Flash on reserved GPU capacity with FriendliAI's inference stack delivers faster inference and lower compute costs than self-hosted deployments. This ensures that agentic workloads, code-completion pipelines, and long-context retrieval running on DeepSeek-V4-Flash are economically viable. For everyday workloads, DeepSeek-V4-Flash on FriendliAI isn't just the obvious choice; it's what’s needed to scale.

DeepSeek-V4-Pro: the open-weight ceiling for agentic coding and long-context reasoning

On knowledge and reasoning benchmarks, DeepSeek-V4-Pro with Think Max posted the top score among all measured frontier models at the time on LiveCodeBench (93.5), Codeforces (3206), and Apex Shortlist (90.2). It also scored 95.2 on HMMT 2026 Feb, 90.1 on GPQA Diamond, and 89.8 on IMOAnswerBench. On agentic evals, it resolved 80.6% of SWE-bench Verified, 76.2% of SWE Multilingual, and 55.4% of SWE Pro — competitive with Opus-4.6 Max and Gemini-3.1-Pro at the time of its preview release. DeepSeek-V4-Pro pulls away from prior open-weight checkpoints on benchmarks related to world knowledge: 57.9 on SimpleQA-Verified was, at the time of its preview release, the highest open-source score on that benchmark by a wide margin. It ranked higher than Opus-4.6 Max (46.2) and GPT-5.4 xHigh (45.3), trailing only Gemini-3.1-Pro (75.6).

The architecture is what makes DeepSeek-V4-Pro a frontier model. With hybrid attention (CSA + HCA), the model requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. The result is a model whose per-request compute cost sits one to two orders of magnitude below the closed-frontier systems it benchmarks against, before any infrastructure optimization is applied. That's why DeepSeek-V4-Pro is the model you choose when an agent has to reason over a million-token codebase, an entire research corpus, or a 500-page legal stack while keeping its facts straight.

Benchmark results for DeepSeek V4 Flash and Pro

Under the hood

Both models are sparse Mixture-of-Experts, but they share the same three architectural elements that define the V4 generation.

Hybrid attention: a combination of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), which makes 1M context economically defensible to ship as the default.
Manifold-Constrained Hyper-Connections (mHC): a tweak to residual connections that DeepSeek reports improves signal stability across deep layers without trimming model expressivity.
Muon optimizer: replaces the usual AdamW pipeline for faster convergence and steadier training.

Both models were pre-trained on more than 32T tokens. They were then post-trained with a two-stage pipeline: independent supervised fine-tuning (SFT) and Group Relative Policy Optimization-based reinforcement learning (GRPO-based RL) on domain-specific experts — a critic-free RL method that scores a group of sampled responses against each other to compute advantages. This was followed by on-policy distillation that consolidates those experts back into a single unified model.

Which model should you pick?

Both models share the same architecture and 1-million-token context window, but they're tuned for different points on the performance-efficiency curve. Use this table as a guide to choose between them based on your workload.

	DeepSeek-V4-Pro	DeepSeek-V4-Flash
Agentic coding on large repos	Best fit — top open-weight scores on SWE-bench Verified (80.6%), SWE Pro, and SWE Multilingual	Strong on routine code completion and daily agent tasks; step down when the codebase fits comfortably in context
Long-context reasoning (research corpora, legal stacks, full codebases)	Best fit — 83.5 MRCR 1M, 62.0 CorpusQA 1M, and frontier world knowledge keep facts straight at full reach	Capable at long context but trails Pro on knowledge-heavy retrieval (SimpleQA, BrowseComp)
Knowledge-intensive Q&A and research agents	Best fit — at launch, 57.9 on SimpleQA-Verified was the highest open-weight score on record	Acceptable for general Q&A; reach for Pro when factual precision matters
High-throughput production agents	Use when each request justifies the larger footprint	Best fit — 13B active params deliver most of Pro's reasoning at a fraction of the GPU cost
Math, STEM, and competitive reasoning	Best fit — 95.2 HMMT 2026 Feb, 90.1 GPQA Diamond, 3206 Codeforces	Close behind on most evals (88.1 GPQA Diamond); fine for routine reasoning at scale
Cost-sensitive or latency-sensitive workloads	Reserve for jobs where the ceiling is worth the spend	Best fit — smaller active footprint means lower TTFT and lower per-request compute cost

Getting started on Friendli

Here are two quick steps to deploy either model on Dedicated Endpoints for the fastest output speeds on reserved GPUs.

Step 1: Create an account and API key

Step 2: Deploy on Friendli Dedicated Endpoints

Deploy DeepSeek V4 on Dedicated Endpoints by following these instructions:

From the Friendli Suite, deploy DeepSeek-V4-Pro or DeepSeek-V4-Flash to a new Dedicated Endpoint.
Pick a GPU architecture from the available hardware tiers based on your throughput and latency targets.
Set autoscaling bounds so the endpoint elastically tracks demand.

Deploy DeepSeek-V4 today

DeepSeek-V4-Flash is the model that makes near-frontier reasoning financially feasible to run at scale — closed-frontier-class capability with 1-million-token context as the default. DeepSeek-V4-Pro is the ceiling above it: at the time of release, the strongest open-weight model on agentic coding with frontier-tier knowledge and reasoning to match. Both ship with three reasoning effort modes — pick Flash for throughput and Pro when the workload deserves the spend.

Friendli Dedicated Endpoints give you faster output speeds and guaranteed availability on reserved GPU capacity, built to run billion- and trillion-parameter MoE models in production. Sign up at friendli.ai and run your first DeepSeek-V4 request today.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 570,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

May 11, 2026
3 min read

FriendliAI Expands to San Francisco to Scale Frontier AI Inference for Open-Weight and Custom Models

Expansion

Growth

Scale

May 19, 2026
5 min read

Accelerating Inference on Friendli Dedicated Endpoints with Draft-Model Speculative Decoding