- May 15, 2026
- 5 min read
What's So Special About DeepSeek V4? Find Out On FriendliAI
- Run DeepSeek-V4-Pro (1.6T MoE / 49B active) or DeepSeek-V4-Flash (284B MoE / 13B active) on Friendli Dedicated Endpoints.
- Get a 1M-token context window by default, with three reasoning effort modes (Non-think, Think High, Think Max) on both models.
- DeepSeek-V4-Pro with Think Max enabled led the open-weight field on key benchmarks: 93.5 LiveCodeBench, 3206 Codeforces, and 80.6 SWE Verified.
- DeepSeek-V4-Flash achieved scores comparable to Claude Sonnet 4.6 on the Artificial Analysis Intelligence Index at a fraction of the compute footprint.
- Choose Flash for high-throughput agents and routine reasoning or Pro when the workload needs frontier-grade depth.

DeepSeek's V4 Preview is one of the most exciting releases of the year for open-weight inference. In the few weeks since launch, DeepSeek-V4-Pro and DeepSeek-V4-Flash have surged to the top of the most-used open-weight models on OpenRouter.
The headline isn't that DeepSeek shipped two new Mixture-of-Experts (MoE) models. It's that the smaller of the two, DeepSeek-V4-Flash, lands near the closed-frontier capability curve at a fraction of the compute footprint, changing the math on what's economically worth running at scale. Meanwhile, DeepSeek-V4-Pro competes with flagship foundation models and leads every open-weight model on agentic coding. Both are live on Friendli Dedicated Endpoints.
Friendli Dedicated Endpoints are purpose-built for these types of workloads, serving high-performance inference for frontier open-weight models with 2–5x faster output speeds and 50–90% lower GPU costs than standard self-hosted alternatives. Follow along to learn what makes V4 special.
Two models, one architecture
DeepSeek-V4-Pro and DeepSeek-V4-Flash are text-in, text-out language models. No vision or audio in this preview — DeepSeek's pitch is depth and context length, not modality breadth. Both expose three reasoning effort modes: Non-think for fast intuitive replies, Think High for deliberate logical analysis, and Think Max for the longest deliberation budget. Both models support a 1-million-token context window out of the box. DeepSeek recommends a context of at least 384K when using Think Max.
DeepSeek-V4-Flash: near-frontier capability at a fraction of the compute
DeepSeek-V4-Flash is among few open-weight models to make near-frontier reasoning efficient enough to run by default. With 284B total parameters, 13B active per token, and Think Max enabled, it scores 47 on the Artificial Analysis Intelligence Index — at the time of release, comparable to Claude Sonnet 4.6 at maximum reasoning effort and 17 points above the average model in its class. It posts 91.6 on LiveCodeBench, 88.1 on GPQA Diamond, 86.2 on MMLU-Pro, and 79.0 on SWE-bench Verified, landing within two points of DeepSeek-V4-Pro on most reasoning evals.
Friendli Dedicated Endpoints compound the advantage. Serving DeepSeek-V4-Flash on reserved GPU capacity with FriendliAI's inference stack delivers faster inference and lower compute costs than self-hosted deployments. This ensures that agentic workloads, code-completion pipelines, and long-context retrieval running on DeepSeek-V4-Flash are economically viable. For everyday workloads, DeepSeek-V4-Flash on FriendliAI isn't just the obvious choice; it's what’s needed to scale.
DeepSeek-V4-Pro: the open-weight ceiling for agentic coding and long-context reasoning
On knowledge and reasoning benchmarks, DeepSeek-V4-Pro with Think Max posted the top score among all measured frontier models at the time on LiveCodeBench (93.5), Codeforces (3206), and Apex Shortlist (90.2). It also scored 95.2 on HMMT 2026 Feb, 90.1 on GPQA Diamond, and 89.8 on IMOAnswerBench. On agentic evals, it resolved 80.6% of SWE-bench Verified, 76.2% of SWE Multilingual, and 55.4% of SWE Pro — competitive with Opus-4.6 Max and Gemini-3.1-Pro at the time of its preview release. DeepSeek-V4-Pro pulls away from prior open-weight checkpoints on benchmarks related to world knowledge: 57.9 on SimpleQA-Verified was, at the time of its preview release, the highest open-source score on that benchmark by a wide margin. It ranked higher than Opus-4.6 Max (46.2) and GPT-5.4 xHigh (45.3), trailing only Gemini-3.1-Pro (75.6).
The architecture is what makes DeepSeek-V4-Pro a frontier model. With hybrid attention (CSA + HCA), the model requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. The result is a model whose per-request compute cost sits one to two orders of magnitude below the closed-frontier systems it benchmarks against, before any infrastructure optimization is applied. That's why DeepSeek-V4-Pro is the model you choose when an agent has to reason over a million-token codebase, an entire research corpus, or a 500-page legal stack while keeping its facts straight.

Under the hood
Both models are sparse Mixture-of-Experts, but they share the same three architectural elements that define the V4 generation.
- Hybrid attention: a combination of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), which makes 1M context economically defensible to ship as the default.
- Manifold-Constrained Hyper-Connections (mHC): a tweak to residual connections that DeepSeek reports improves signal stability across deep layers without trimming model expressivity.
- Muon optimizer: replaces the usual AdamW pipeline for faster convergence and steadier training.
Both models were pre-trained on more than 32T tokens. They were then post-trained with a two-stage pipeline: independent supervised fine-tuning (SFT) and Group Relative Policy Optimization-based reinforcement learning (GRPO-based RL) on domain-specific experts — a critic-free RL method that scores a group of sampled responses against each other to compute advantages. This was followed by on-policy distillation that consolidates those experts back into a single unified model.
Which model should you pick?
Both models share the same architecture and 1-million-token context window, but they're tuned for different points on the performance-efficiency curve. Use this table as a guide to choose between them based on your workload.
| DeepSeek-V4-Pro | DeepSeek-V4-Flash | |
|---|---|---|
| Agentic coding on large repos | Best fit — top open-weight scores on SWE-bench Verified (80.6%), SWE Pro, and SWE Multilingual | Strong on routine code completion and daily agent tasks; step down when the codebase fits comfortably in context |
| Long-context reasoning (research corpora, legal stacks, full codebases) | Best fit — 83.5 MRCR 1M, 62.0 CorpusQA 1M, and frontier world knowledge keep facts straight at full reach | Capable at long context but trails Pro on knowledge-heavy retrieval (SimpleQA, BrowseComp) |
| Knowledge-intensive Q&A and research agents | Best fit — at launch, 57.9 on SimpleQA-Verified was the highest open-weight score on record | Acceptable for general Q&A; reach for Pro when factual precision matters |
| High-throughput production agents | Use when each request justifies the larger footprint | Best fit — 13B active params deliver most of Pro's reasoning at a fraction of the GPU cost |
| Math, STEM, and competitive reasoning | Best fit — 95.2 HMMT 2026 Feb, 90.1 GPQA Diamond, 3206 Codeforces | Close behind on most evals (88.1 GPQA Diamond); fine for routine reasoning at scale |
| Cost-sensitive or latency-sensitive workloads | Reserve for jobs where the ceiling is worth the spend | Best fit — smaller active footprint means lower TTFT and lower per-request compute cost |
Getting started on Friendli
Here are two quick steps to deploy either model on Dedicated Endpoints for the fastest output speeds on reserved GPUs.
Step 1: Create an account and API key
Sign up at friendli.ai, and generate an API key from the dashboard.
Step 2: Deploy on Friendli Dedicated Endpoints
Deploy DeepSeek V4 on Dedicated Endpoints by following these instructions:
- From the Friendli Suite, deploy DeepSeek-V4-Pro or DeepSeek-V4-Flash to a new Dedicated Endpoint.
- Pick a GPU architecture from the available hardware tiers based on your throughput and latency targets.
- Set autoscaling bounds so the endpoint elastically tracks demand.
Deploy DeepSeek-V4 today
DeepSeek-V4-Flash is the model that makes near-frontier reasoning financially feasible to run at scale — closed-frontier-class capability with 1-million-token context as the default. DeepSeek-V4-Pro is the ceiling above it: at the time of release, the strongest open-weight model on agentic coding with frontier-tier knowledge and reasoning to match. Both ship with three reasoning effort modes — pick Flash for throughput and Pro when the workload deserves the spend.
Friendli Dedicated Endpoints give you faster output speeds and guaranteed availability on reserved GPU capacity, built to run billion- and trillion-parameter MoE models in production. Sign up at friendli.ai and run your first DeepSeek-V4 request today.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 550,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

