Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

TL;DR

Recommended hardware2× DGX Spark or 2× RTX PRO 6000, TP=2
QualityGSM8K 95.07–95.45% strict (8-shot); HumanEval pass@1 78.05–80.49% (strict, --confirm_run_unsafe_code)
Throughput47–48 output tok/s @ bs=1 on RTX PRO 6000 TP=2 (TPOT 20.8 ms); 14–17 tok/s on DGX Spark TP=2
DifferentiatorOnly quant of V4-Flash that serves on SM 9.x and SM 12.x; baseline for the W4A16-FP8-MTP successor

Family / related artifacts

RepoRoleRelation to this artifact
canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTPsuccessorSame recipe + BF16 MTP retained for 1.49× spec-decode speedup at bs=1
canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTPsiblingNVFP4 routed experts (Blackwell-native), MTP retained
canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTPlarger siblingV4-Pro at NVFP4 with MTP, B300-only deployment
RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8upstream referenceOriginal mixed-precision topology (NVFP4 experts + FP8 attention) we adapted to W4A16

Why this exists

DeepSeek-V4-Flash launched April 24, 2026 (284 B total / 13 B active, hybrid CSA + HCA attention, hash-routed experts). At release, no merged path through transformers + llm-compressor + vLLM existed for V4 quantization on Hopper or on SM 12.x Blackwell. RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 covered Blackwell datacenter (B100/B200, SM 10.x) via NVFP4 tcgen05 kernels, and Intel/DeepSeek-V4-Flash-W4A16-AutoRound covered W4A16 but explicitly excluded vLLM and SGLang. This artifact fills the gap: W4A16 GPTQ routed experts + FP8 block attention that serves on vLLM at TP=2 on H200 (Hopper SM 9.0a), DGX Spark (Blackwell SM 12.1a), and RTX PRO 6000 (Blackwell SM 12.0) — same weights, three SKUs.

Architecture & precision

Base model

PropertyValue
Total parameters~284 B (~13 B active per token)
Decoder layers43
Routed experts / layer256 (top-K = 6)
Hidden size4096
Base BF16 size~543 GB
Quantized size~143 GB
Compression ratio~3.8×

Component precisions

ComponentFormatMethod
Routed experts (256 × 43 layers)W4A16 INT4, group_size=128, symmetricGPTQ via llm-compressor, dampening_frac=0.1
Attention path (q_a/q_b/kv/o_a/o_b, compressor, indexer)FP8_BLOCK 128×128Dynamic, data-free
Shared expertsBF16Excluded (kylesayrs PR #41276 incompatibility)
Embeddings, lm_head, hc_headBF16Excluded
MTP blockdropped at loadRemoved by transformers _keys_to_ignore_on_load_unexpected — see W4A16-FP8-MTP successor for the retention recipe

Hardware validated

PlatformSMHBM/GPUInterconnectTPRole
8× NVIDIA H200 SXM59.0a141 GB HBM3eNVLink2 (4× replicas)Calibration + harness baseline
2× NVIDIA DGX Spark (GB10)12.1a128 GB unifiedNVLink-C2C2Long-context production (1M-token graphs-ON)
2× NVIDIA RTX PRO 6000 Blackwell Server Edition12.0, sm_12096 GB HBMPCIe2Workstation Blackwell deployment

All three SKUs serve cuda graphs ON (no --enforce-eager). Same artifact, no weight changes between SKUs — only vLLM build flags and a few env vars differ.

Benchmarks

Quality

Sampling: greedy, temperature 0. lm-eval-harness via OpenAI-compatible backend pointing at the local vLLM. Methodology disclosed per row.

BenchmarkSetting8× H200 (older vLLM build)2× DGX Spark TP=22× RTX PRO 6000 TP=2
GSM8K8-shot, flexible-extract92.87% ± 0.7195.37% ± 0.5894.99% ± 0.60
GSM8K8-shot, strict-match~~42.61%~~¹ → see note95.45% ± 0.5795.07% ± 0.60
MMLU5-shot87.27% ± 0.27(in flight)(pending)
HumanEval0-shot pass@1 (instruct, --confirm_run_unsafe_code)54.27% ± 3.9² → 80.49% ± 3.10³80.49% ± 3.1078.05% ± 3.24
chat-smoke (quick / quality / coding)harness4/4 · 4/4 · 2/24/4 · 4/4 · 2/24/4 · 4/4 · 2/2
toolcall151 round, 30 points26/30 (87%)41/45 (92%)⁴27/30 (90%)
NIAH long-context (75K → 500K single)retrieval4/4 retrieval5/5 retrieval
NIAH 256K × 2 concurrentretrievalfix landed in jasl@e734ace54/4 (377 s)

¹ The H200 GSM8K strict-match of 42.61% was a chat-format extraction artifact, not a quality regression. The flexible-extract number (92.87%) is the comparable figure. Cross-checked on DGX Spark / RTX PRO 6000 with corrected extraction (95.07–95.45%).

² ³ HumanEval pass@1 on H200 was initially reported as 54.27% under regex-based extraction. The harness was later corrected to use --confirm_run_unsafe_code (executes generated code), which raised the same-artifact score to 80.49%. The Spark and RTX PRO 6000 runs use the corrected methodology; the H200 number is the same artifact re-scored. See Changes for the dated correction.

⁴ Spark toolcall15 is scored across 3 thinking modes (45 cases); H200 / RTX PRO 6000 are single-round (30 cases). Scores normalized to %.

Comparison caveat: the H200 numbers come from an older vLLM build (harness HEAD 85aca32, jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today's ds4-sm120-experimental tip. The valid same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 ↔ Blackwell deltas are informational.

Throughput

vllm bench serve random 1024-in / 1024-out, cuda graphs ON, MTP-spec n/a (this artifact ships without MTP).

HardwareTPbs=1 output tok/sbs=1 TPOT medianbs=2 output tok/sbs=2 TPOT median
2× DGX Spark214–17
2× DGX Spark2 (eager fallback)3–4
2× RTX PRO 6000247.520.8 ms84.021.7 ms

Per-stream decode rate on RTX PRO 6000 is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at bs=2 reaches 420 tok/s.

Quick start

bash

vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
--served-model-name DSV4-W4A16-FP8 \
--tensor-parallel-size 2 \
--kv-cache-dtype fp8 \
--block-size 256 \
--max-model-len 16384 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--trust-remote-code

Required env vars on SM 12.x sparse-MLA path: set VLLM_TRITON_MLA_SPARSE=1 and VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4. Without _HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel crashes during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered in _dequantize_and_gather_k_kernel (kernel falls back to a default block size that doesn't match V4-Flash's head dim). Full env block at findings/QUICKSTART_DUAL_SPARK.md §4.

Long-context (1M tokens, single stream): drop --max-num-seqs to 1, --gpu-memory-utilization to 0.90, set --max-model-len 1048576 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.

Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).

RTX PRO 6000 (SM 12.0) only: set VLLM_USE_FLASHINFER_SAMPLER=0 — vLLM's FlashInfer-based top-p / top-k sampler JIT mis-parses the TORCH_CUDA_ARCH_LIST=12.0a token and incorrectly raises RuntimeError: FlashInfer requires GPUs with sm75 or higher.

Quantization recipe

PropertyValue
DatasetHuggingFaceH4/ultrachat_200k (V4 chat template)
Samples768
Max sequence length512
Per-rank batch size4
Hardware8× NVIDIA H200 (p5en.48xlarge)
Walltime~14 hours

Required calibration environment

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm

expandable_segments is calibration-only — must not be set during vLLM serving.

What didn't work (recorded so others don't waste cycles)

ConfigResult
samples=1024, bs=32, no offload, no expandable_segmentsOOM at Layer 3 (45–67 GiB activation alloc fail)
samples=1024, bs=8, same as aboveOOM at Layer 3 (32 GiB alloc fail)
samples=1024, bs=8, offload_hessians=TrueOOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block)
samples=1024, bs=4, +offload_hessians, +expandable_segmentsNCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift)
samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeoutSucceeded — 14h end-to-end
sequential_targets=["Linear"] (any sample count)torch.fx.proxy.TraceError on DeepseekV4Indexer.wrapped_1's data-dependent control flow — would need is_leaf_module patch to register Indexer as leaf

Recipe

python

from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme
recipe = GPTQModifier(
config_groups={
"attention": QuantizationScheme(
targets=[
r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
],
**FP8_BLOCK,
),
"experts": QuantizationScheme(
targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
**W4A16,
),
},
ignore=["lm_head"],
offload_hessians=True,
dampening_frac=0.1,
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=512,
num_calibration_samples=768,
sequential_targets=["DeepseekV4DecoderLayer"],
batch_size=4,
)

vLLM build

This artifact does not load on vanilla vLLM. Stack:

ComponentPinNotes
jasl/vllmds4-sm120-experimental (or ds4-sm120 for conservative)SM12x DSV4 support
kylesayrs deepseek-ct patchcontent-pinned, vendored at scripts/kylesayrs-deepseek-ct.patchRebased successor of f910a73a93 (force-pushed out of upstream history; see issue #1)
packed_modules_mapping patchpatches/packed_modules_mapping.diffRequired as of abad5dc71 (2026-05-05) — kylesayrs patch doesn't add this attribute
Workspace pre-reservation patchlanded upstream as jasl/vllm@1d6f5c4Was vllm-project/vllm#41700 — no longer needs local apply

Single-file bootstrap script for dual DGX Spark: scripts/bootstrap_dsv4_spark.sh — does the whole stack zero-to-serving.

Upstream tracker: original PR #40991 (where Spark validation was posted) closed 2026-05-06; current tracker is PR #41834"[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes", branch codex/ds4-sm120-min-enable.

Honest limitations

  • No MTPtransformers 5.8.1's _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] silently strips MTP keys during calibration load. Speculative decoding cannot fire with this artifact. The W4A16-FP8-MTP successor retains MTP via a patched calibration path and delivers 1.49× spec-decode speedup at bs=1.
  • TP > 2 blocked by vllm-project/vllm#41511 — W4A16 MoE scale-sharding bug.
  • H200 numbers from older vLLM build — H200 baseline was scored on jasl/vllm@428e08e (harness HEAD 85aca32). Same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 → Blackwell deltas are informational.
  • toolcall15 TC-06 (Multi-Value Extraction) and TC-08 (Conditional Branching) also fail on the native FP4/FP8 baseline — V4-Flash model-architecture limits, not quantization defects.
  • 2026-05-25: artifact has shipping issues on current upstream vLLM. Two problems were surfaced when attempting to load this artifact on jasl/vllm@a02a3778f (the post-PR-#40923 build the sibling W4A16-MTP card now uses): (1) Same FP8_BLOCK compressor/indexer shipping bug as the MTP sibling — current vLLM constructs those modules as plain BF16 (quant_config=None) and the artifact fails with KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'. The MTP sibling fixed this by dequantizing those weights in-artifact to BF16; this artifact has not yet had that fix applied. (2) A separate architecture-drift issue: the artifact lacks the layers.N.ffn.gate.e_score_correction_bias tensor that current upstream vLLM's DSV4 loader requires (KeyError). Either re-calibration that emits this tensor, or a defensive .get() loader patch upstream is needed. The published H200/Spark/RTX PRO 6000 numbers above remain valid for their original jasl/vllm@ds4-sm120-experimental@abad5dc71 build (2026-05-05); they do not currently reproduce on bleeding-edge vLLM. Tracking and re-verification deferred to the next session.

Reproduction

Full toolchain, scripts, patches, mission report: canada-quant/dsv4-flash-w4a16-fp8.

Single-file bootstrap (dual DGX Spark, idempotent, SSH-orchestrated):

bash

curl -fsSLO https://raw.githubusercontent.com/canada-quant/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b

Upstream contributions filed during this work

PR / IssueDescriptionStatus
vllm-project/vllm#41700Workspace pre-reservation patchlanded as jasl/vllm@1d6f5c4
vllm-project/vllm#41511Marlin MoE TP scale-sharding bugopen — blocks TP>2
vllm-project/vllm#40991#41834SM12x DeepSeek V4 base supportopen (jasl)
vllm-project/vllm#41276compressed-tensors V4 attention pathopen (kylesayrs)

Changes

DateChange
2026-05-06DGX Spark TP=2 production canonical at 1M-token context graphs-ON validated on ds4-sm120-experimental
2026-05-08Kylesayrs branch f910a73a93 force-pushed out of upstream history; vendored content-pinned rebased successor d09eeb498 at scripts/kylesayrs-deepseek-ct.patch (issue #1)
2026-05-19HumanEval methodology correction: H200 pass@1 was scored at 54.27% under regex extraction; re-scored at 80.49% with --confirm_run_unsafe_code. Same artifact, methodology change. Earlier 54.27% number is shown struck through in the quality table
2026-05-23Workspace pre-reservation patch landed upstream as jasl/vllm@1d6f5c4; closes our #41700. No local apply needed
2026-05-24RTX PRO 6000 Blackwell (SM 12.0) added to validated hardware — chat-smoke 4/4, toolcall15 27/30 (90%), GSM8K 95.07%, NIAH 256K × 2 concurrent PASS
2026-05-25Two shipping issues surfaced when re-testing on current upstream vLLM (jasl/vllm@a02a3778f). (1) Same FP8 compressor/indexer load-failure as the W4A16-MTP sibling — fixable via the same in-artifact BF16 dequant; not yet applied to this artifact. (2) Architecture-drift KeyError: 'layers.N.ffn.gate.e_score_correction_bias' — Card A's older safetensors (calibrated 2026-05-06) don't contain a tensor that current vLLM's DSV4 loader expects; needs re-calibration or a defensive loader patch. Published RTX PRO 6000 numbers above remain valid for the May-5 jasl build; current-build re-verification deferred. See session_summary_2026_05_24.md.

Files in the artifact

  • ~30 sharded model-*.safetensors files + model.safetensors.index.json (~143 GB total)
  • config.json — vLLM-compatible quantization_config (W4A16 + FP8_BLOCK groups)
  • tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
  • recipe.yaml — the llm-compressor calibration recipe
  • chat_template.jinja — upstream DSV4-Flash (unchanged)
  • README.md — this file

Citation

bibtex

@misc{canada-quant-dsv4-flash-w4a16-fp8-2026,
title = {DeepSeek-V4-Flash W4A16-FP8 for vLLM on Hopper and Blackwell},
author = {Canada Quant},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

Model provider

canada-quant

Model tree

Base

deepseek-ai/DeepSeek-V4-Flash

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today