DeepSeek-V4-Flash-W4A16-FP8 API & Inference Endpoint

TL;DR

Table

Recommended hardware	2× DGX Spark or 2× RTX PRO 6000, TP=2
Quality	GSM8K 95.07–95.45% strict (8-shot); HumanEval pass@1 78.05–80.49% (strict, `--confirm_run_unsafe_code`)
Throughput	47–48 output tok/s @ bs=1 on RTX PRO 6000 TP=2 (TPOT 20.8 ms); 14–17 tok/s on DGX Spark TP=2
Differentiator	Only quant of V4-Flash that serves on SM 9.x and SM 12.x; baseline for the W4A16-FP8-MTP successor

Table with columns: Repo, Role, Relation to this artifact
Repo	Role	Relation to this artifact
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`	successor	Same recipe + BF16 MTP retained for 1.49× spec-decode speedup at bs=1
`canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP`	sibling	NVFP4 routed experts (Blackwell-native), MTP retained
`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`	larger sibling	V4-Pro at NVFP4 with MTP, B300-only deployment

Why this exists

DeepSeek-V4-Flash launched April 24, 2026 (284 B total / 13 B active, hybrid CSA + HCA attention, hash-routed experts). At release, no merged path through transformers + llm-compressor + vLLM existed for V4 quantization on Hopper or on SM 12.x Blackwell. RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 covered Blackwell datacenter (B100/B200, SM 10.x) via NVFP4 tcgen05 kernels, and Intel/DeepSeek-V4-Flash-W4A16-AutoRound covered W4A16 but explicitly excluded vLLM and SGLang. This artifact fills the gap: W4A16 GPTQ routed experts + FP8 block attention that serves on vLLM at TP=2 on H200 (Hopper SM 9.0a), DGX Spark (Blackwell SM 12.1a), and RTX PRO 6000 (Blackwell SM 12.0) — same weights, three SKUs.

Architecture & precision

Base model

Table with columns: Property, Value
Property	Value
Total parameters	~284 B (~13 B active per token)
Decoder layers	43
Routed experts / layer	256 (top-K = 6)
Hidden size	4096
Base BF16 size	~543 GB
Quantized size	~143 GB
Compression ratio	~3.8×

Component precisions

Table with columns: Component, Format, Method
Component	Format	Method
Routed experts (256 × 43 layers)	W4A16 INT4, group_size=128, symmetric	GPTQ via llm-compressor, `dampening_frac=0.1`
Attention path (`q_a/q_b/kv/o_a/o_b`, compressor, indexer)	FP8_BLOCK 128×128	Dynamic, data-free
Shared experts	BF16	Excluded (kylesayrs PR #41276 incompatibility)
Embeddings, `lm_head`, `hc_head`	BF16	Excluded

Hardware validated

Table with columns: Platform, SM, HBM/GPU, Interconnect, TP, Role
Platform	SM	HBM/GPU	Interconnect	TP	Role
8× NVIDIA H200 SXM5	9.0a	141 GB HBM3e	NVLink	2 (4× replicas)	Calibration + harness baseline
2× NVIDIA DGX Spark (GB10)	12.1a	128 GB unified	NVLink-C2C	2	Long-context production (1M-token graphs-ON)
2× NVIDIA RTX PRO 6000 Blackwell Server Edition	12.0, sm_120

All three SKUs serve cuda graphs ON (no --enforce-eager). Same artifact, no weight changes between SKUs — only vLLM build flags and a few env vars differ.

Benchmarks

Quality

Sampling: greedy, temperature 0. lm-eval-harness via OpenAI-compatible backend pointing at the local vLLM. Methodology disclosed per row.

Table with columns: Benchmark, Setting, 8× H200 (older vLLM build), 2× DGX Spark TP=2, 2× RTX PRO 6000 TP=2
Benchmark	Setting	8× H200 (older vLLM build)	2× DGX Spark TP=2	2× RTX PRO 6000 TP=2
GSM8K	8-shot, flexible-extract	92.87% ± 0.71	95.37% ± 0.58	94.99% ± 0.60
GSM8K	8-shot, strict-match	~~42.61%~~¹ → see note	95.45% ± 0.57	95.07% ± 0.60
MMLU	5-shot	87.27% ± 0.27	(in flight)

¹ The H200 GSM8K strict-match of 42.61% was a chat-format extraction artifact, not a quality regression. The flexible-extract number (92.87%) is the comparable figure. Cross-checked on DGX Spark / RTX PRO 6000 with corrected extraction (95.07–95.45%).

² ³ HumanEval pass@1 on H200 was initially reported as 54.27% under regex-based extraction. The harness was later corrected to use --confirm_run_unsafe_code (executes generated code), which raised the same-artifact score to 80.49%. The Spark and RTX PRO 6000 runs use the corrected methodology; the H200 number is the same artifact re-scored. See Changes for the dated correction.

⁴ Spark toolcall15 is scored across 3 thinking modes (45 cases); H200 / RTX PRO 6000 are single-round (30 cases). Scores normalized to %.

Comparison caveat: the H200 numbers come from an older vLLM build (harness HEAD 85aca32, jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today's ds4-sm120-experimental tip. The valid same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 ↔ Blackwell deltas are informational.

Throughput

vllm bench serve random 1024-in / 1024-out, cuda graphs ON, MTP-spec n/a (this artifact ships without MTP).

Table with columns: Hardware, TP, bs=1 output tok/s, bs=1 TPOT median, bs=2 output tok/s, bs=2 TPOT median
Hardware	TP	bs=1 output tok/s	bs=1 TPOT median	bs=2 output tok/s	bs=2 TPOT median
2× DGX Spark	2	14–17	—	—	—
2× DGX Spark	2 (eager fallback)	3–4	—	—	—
2× RTX PRO 6000	2

Per-stream decode rate on RTX PRO 6000 is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at bs=2 reaches 420 tok/s.

Quick start

bash
vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name DSV4-W4A16-FP8 \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code

Required env vars on SM 12.x sparse-MLA path: set VLLM_TRITON_MLA_SPARSE=1 and VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4. Without _HEAD_BLOCK_SIZE=4 the sparse-MLA Triton kernel crashes during warmup with RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered in _dequantize_and_gather_k_kernel (kernel falls back to a default block size that doesn't match V4-Flash's head dim). Full env block at findings/QUICKSTART_DUAL_SPARK.md §4.

Long-context (1M tokens, single stream): drop --max-num-seqs to 1, --gpu-memory-utilization to 0.90, set --max-model-len 1048576 --max-num-batched-tokens 8192 --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'.

Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).

RTX PRO 6000 (SM 12.0) only: set VLLM_USE_FLASHINFER_SAMPLER=0 — vLLM's FlashInfer-based top-p / top-k sampler JIT mis-parses the TORCH_CUDA_ARCH_LIST=12.0a token and incorrectly raises RuntimeError: FlashInfer requires GPUs with sm75 or higher.

Quantization recipe

Table with columns: Property, Value
Property	Value
Dataset	`HuggingFaceH4/ultrachat_200k` (V4 chat template)
Samples	768
Max sequence length	512
Per-rank batch size	4
Hardware	8× NVIDIA H200 (`p5en.48xlarge`)
Walltime	~14 hours

Required calibration environment

bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm

expandable_segments is calibration-only — must not be set during vLLM serving.

What didn't work (recorded so others don't waste cycles)

Table with columns: Config, Result
Config	Result
`samples=1024, bs=32, no offload, no expandable_segments`	OOM at Layer 3 (45–67 GiB activation alloc fail)
`samples=1024, bs=8`, same as above	OOM at Layer 3 (32 GiB alloc fail)
`samples=1024, bs=8, offload_hessians=True`	OOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block)
`samples=1024, bs=4, +offload_hessians, +expandable_segments`	NCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift)
`samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout`

Recipe

python
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme

recipe = GPTQModifier(
    config_groups={
        "attention": QuantizationScheme(
            targets=[
                r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
                r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
                r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
            ],
            **FP8_BLOCK,
        ),
        "experts": QuantizationScheme(
            targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
            **W4A16,
        ),
    },
    ignore=["lm_head"],
    offload_hessians=True,
    dampening_frac=0.1,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=512,
    num_calibration_samples=768,
    sequential_targets=["DeepseekV4DecoderLayer"],
    batch_size=4,
)

vLLM build

This artifact does not load on vanilla vLLM. Stack:

Table with columns: Component, Pin, Notes
Component	Pin	Notes
`jasl/vllm`	`ds4-sm120-experimental` (or `ds4-sm120` for conservative)	SM12x DSV4 support
kylesayrs deepseek-ct patch	content-pinned, vendored at `scripts/kylesayrs-deepseek-ct.patch`	Rebased successor of `f910a73a93` (force-pushed out of upstream history; see issue #1)

Single-file bootstrap script for dual DGX Spark: scripts/bootstrap_dsv4_spark.sh — does the whole stack zero-to-serving.

Upstream tracker: original PR #40991 (where Spark validation was posted) closed 2026-05-06; current tracker is PR #41834 — "[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes", branch codex/ds4-sm120-min-enable.

Honest limitations

No MTP — transformers 5.8.1's _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] silently strips MTP keys during calibration load. Speculative decoding cannot fire with this artifact. The W4A16-FP8-MTP successor retains MTP via a patched calibration path and delivers 1.49× spec-decode speedup at bs=1.
TP > 2 blocked by vllm-project/vllm#41511 — W4A16 MoE scale-sharding bug.
H200 numbers from older vLLM build — H200 baseline was scored on jasl/vllm@428e08e (harness HEAD 85aca32). Same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 → Blackwell deltas are informational.
toolcall15 TC-06 (Multi-Value Extraction) and TC-08 (Conditional Branching) also fail on the native FP4/FP8 baseline — V4-Flash model-architecture limits, not quantization defects.
2026-05-25: artifact has shipping issues on upstream vLLM. Two problems were surfaced when attempting to load this artifact on (the post-PR-#40923 build the sibling now uses): Same FP8_BLOCK compressor/indexer shipping bug as the MTP sibling — current vLLM constructs those modules as plain BF16 () and the artifact fails with . The MTP sibling fixed this by dequantizing those weights in-artifact to BF16; . A separate architecture-drift issue: the artifact lacks the tensor that current upstream vLLM's DSV4 loader requires (). Either re-calibration that emits this tensor, or a defensive loader patch upstream is needed. (2026-05-05); they do not currently reproduce on bleeding-edge vLLM. Tracking and re-verification deferred to the next session.

Reproduction

Full toolchain, scripts, patches, mission report: canada-quant/dsv4-flash-w4a16-fp8.

Single-file bootstrap (dual DGX Spark, idempotent, SSH-orchestrated):

bash
curl -fsSLO https://raw.githubusercontent.com/canada-quant/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b

Upstream contributions filed during this work

Table with columns: PR / Issue, Description, Status
PR / Issue	Description	Status
`vllm-project/vllm#41700`	Workspace pre-reservation patch	landed as `jasl/vllm@1d6f5c4`
`vllm-project/vllm#41511`	Marlin MoE TP scale-sharding bug	open — blocks TP>2
`vllm-project/vllm#40991` →

Changes

Table with columns: Date, Change
Date	Change
2026-05-06	DGX Spark TP=2 production canonical at 1M-token context graphs-ON validated on `ds4-sm120-experimental`
2026-05-08	Kylesayrs branch `f910a73a93` force-pushed out of upstream history; vendored content-pinned rebased successor `d09eeb498` at `scripts/kylesayrs-deepseek-ct.patch` (issue #1)
2026-05-19	HumanEval methodology correction: H200 pass@1 was scored at 54.27% under regex extraction; re-scored at 80.49% with `--confirm_run_unsafe_code`. Same artifact, methodology change. Earlier 54.27% number is shown struck through in the quality table
2026-05-23	Workspace pre-reservation patch landed upstream as ; closes our . No local apply needed

Files in the artifact

~30 sharded model-*.safetensors files + model.safetensors.index.json (~143 GB total)
config.json — vLLM-compatible quantization_config (W4A16 + FP8_BLOCK groups)
tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
recipe.yaml — the llm-compressor calibration recipe
chat_template.jinja — upstream DSV4-Flash (unchanged)
README.md — this file

Citation

bibtex
@misc{canada-quant-dsv4-flash-w4a16-fp8-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 for vLLM on Hopper and Blackwell},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

@jasl — DeepSeek-V4 vLLM SM12x base support (PR #40991 → #41834); memory-pressure-release fix e734ace5 that resolved the Blackwell 256K×2 stall.
@kylesayrs — compressed-tensors V4 attention path (PR #41276).
@aabbccddwasd — indexer KV cache layout fix.
@bbbearxyz — SM12x Triton fallback kernels.

TL;DR

Table

Recommended hardware	2× DGX Spark or 2× RTX PRO 6000, TP=2
Quality	GSM8K 95.07–95.45% strict (8-shot); HumanEval pass@1 78.05–80.49% (strict, `--confirm_run_unsafe_code`)
Throughput	47–48 output tok/s @ bs=1 on RTX PRO 6000 TP=2 (TPOT 20.8 ms); 14–17 tok/s on DGX Spark TP=2
Differentiator	Only quant of V4-Flash that serves on SM 9.x and SM 12.x; baseline for the W4A16-FP8-MTP successor

Table with columns: Repo, Role, Relation to this artifact
Repo	Role	Relation to this artifact
`canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP`	successor	Same recipe + BF16 MTP retained for 1.49× spec-decode speedup at bs=1
`canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP`	sibling	NVFP4 routed experts (Blackwell-native), MTP retained
`canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP`	larger sibling	V4-Pro at NVFP4 with MTP, B300-only deployment

Why this exists

Architecture & precision

Base model

Table with columns: Property, Value
Property	Value
Total parameters	~284 B (~13 B active per token)
Decoder layers	43
Routed experts / layer	256 (top-K = 6)
Hidden size	4096
Base BF16 size	~543 GB
Quantized size	~143 GB
Compression ratio	~3.8×

Component precisions

Table with columns: Component, Format, Method
Component	Format	Method
Routed experts (256 × 43 layers)	W4A16 INT4, group_size=128, symmetric	GPTQ via llm-compressor, `dampening_frac=0.1`
Attention path (`q_a/q_b/kv/o_a/o_b`, compressor, indexer)	FP8_BLOCK 128×128	Dynamic, data-free
Shared experts	BF16	Excluded (kylesayrs PR #41276 incompatibility)
Embeddings, `lm_head`, `hc_head`	BF16	Excluded

Hardware validated

Table with columns: Platform, SM, HBM/GPU, Interconnect, TP, Role
Platform	SM	HBM/GPU	Interconnect	TP	Role
8× NVIDIA H200 SXM5	9.0a	141 GB HBM3e	NVLink	2 (4× replicas)	Calibration + harness baseline
2× NVIDIA DGX Spark (GB10)	12.1a	128 GB unified	NVLink-C2C	2	Long-context production (1M-token graphs-ON)
2× NVIDIA RTX PRO 6000 Blackwell Server Edition	12.0, sm_120

All three SKUs serve cuda graphs ON (no --enforce-eager). Same artifact, no weight changes between SKUs — only vLLM build flags and a few env vars differ.

Benchmarks

Quality

Sampling: greedy, temperature 0. lm-eval-harness via OpenAI-compatible backend pointing at the local vLLM. Methodology disclosed per row.

Table with columns: Benchmark, Setting, 8× H200 (older vLLM build), 2× DGX Spark TP=2, 2× RTX PRO 6000 TP=2
Benchmark	Setting	8× H200 (older vLLM build)	2× DGX Spark TP=2	2× RTX PRO 6000 TP=2
GSM8K	8-shot, flexible-extract	92.87% ± 0.71	95.37% ± 0.58	94.99% ± 0.60
GSM8K	8-shot, strict-match	~~42.61%~~¹ → see note	95.45% ± 0.57	95.07% ± 0.60
MMLU	5-shot	87.27% ± 0.27	(in flight)

⁴ Spark toolcall15 is scored across 3 thinking modes (45 cases); H200 / RTX PRO 6000 are single-round (30 cases). Scores normalized to %.

Comparison caveat: the H200 numbers come from an older vLLM build (harness HEAD 85aca32, jasl/vllm@428e08e). Spark and RTX PRO 6000 numbers are on today's ds4-sm120-experimental tip. The valid same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 ↔ Blackwell deltas are informational.

Throughput

vllm bench serve random 1024-in / 1024-out, cuda graphs ON, MTP-spec n/a (this artifact ships without MTP).

Table with columns: Hardware, TP, bs=1 output tok/s, bs=1 TPOT median, bs=2 output tok/s, bs=2 TPOT median
Hardware	TP	bs=1 output tok/s	bs=1 TPOT median	bs=2 output tok/s	bs=2 TPOT median
2× DGX Spark	2	14–17	—	—	—
2× DGX Spark	2 (eager fallback)	3–4	—	—	—
2× RTX PRO 6000	2

Per-stream decode rate on RTX PRO 6000 is rock-stable across concurrency (TPOT mean stays at 21 ms, p99 only 23 ms). Aggregate input+output throughput at bs=2 reaches 420 tok/s.

Quick start

bash
vllm serve canada-quant/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name DSV4-W4A16-FP8 \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code

Tensor parallelism: TP=2 is the only validated configuration. TP=1 OOMs on a single 141 GB H200; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).

Quantization recipe

Table with columns: Property, Value
Property	Value
Dataset	`HuggingFaceH4/ultrachat_200k` (V4 chat template)
Samples	768
Max sequence length	512
Per-rank batch size	4
Hardware	8× NVIDIA H200 (`p5en.48xlarge`)
Walltime	~14 hours

Required calibration environment

bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=3600
export TORCH_NCCL_BLOCKING_WAIT=0
export NCCL_TIMEOUT=3600
export TORCH_CUDA_ARCH_LIST=9.0a
sudo mount -o remount,size=1800G /dev/shm

expandable_segments is calibration-only — must not be set during vLLM serving.

What didn't work (recorded so others don't waste cycles)

Table with columns: Config, Result
Config	Result
`samples=1024, bs=32, no offload, no expandable_segments`	OOM at Layer 3 (45–67 GiB activation alloc fail)
`samples=1024, bs=8`, same as above	OOM at Layer 3 (32 GiB alloc fail)
`samples=1024, bs=8, offload_hessians=True`	OOM at Layer 3 (30 GiB alloc fail; fragmentation blocks contiguous block)
`samples=1024, bs=4, +offload_hessians, +expandable_segments`	NCCL collective timeout at Layer 22 (10 min default exceeded by per-rank drift)
`samples=768, bs=4, +offload_hessians, +expandable_segments, +60min NCCL timeout`

Recipe

python
from llmcompressor.modifiers.quantization import GPTQModifier
from compressed_tensors.quantization.quant_scheme import FP8_BLOCK, W4A16, QuantizationScheme

recipe = GPTQModifier(
    config_groups={
        "attention": QuantizationScheme(
            targets=[
                r"re:.*self_attn\.(q_a_proj|q_b_proj|kv_proj|o_a_proj|o_b_proj)$",
                r"re:.*self_attn\.compressor\.(gate_proj|kv_proj)$",
                r"re:.*self_attn\.compressor\.indexer\.(gate_proj|kv_proj|q_b_proj|weights_proj)$",
            ],
            **FP8_BLOCK,
        ),
        "experts": QuantizationScheme(
            targets=[r"re:.*mlp\.experts\.\d+\.(gate_proj|up_proj|down_proj)$"],
            **W4A16,
        ),
    },
    ignore=["lm_head"],
    offload_hessians=True,
    dampening_frac=0.1,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=512,
    num_calibration_samples=768,
    sequential_targets=["DeepseekV4DecoderLayer"],
    batch_size=4,
)

vLLM build

This artifact does not load on vanilla vLLM. Stack:

Table with columns: Component, Pin, Notes
Component	Pin	Notes
`jasl/vllm`	`ds4-sm120-experimental` (or `ds4-sm120` for conservative)	SM12x DSV4 support
kylesayrs deepseek-ct patch	content-pinned, vendored at `scripts/kylesayrs-deepseek-ct.patch`	Rebased successor of `f910a73a93` (force-pushed out of upstream history; see issue #1)

Single-file bootstrap script for dual DGX Spark: scripts/bootstrap_dsv4_spark.sh — does the whole stack zero-to-serving.

Honest limitations

No MTP — transformers 5.8.1's _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] silently strips MTP keys during calibration load. Speculative decoding cannot fire with this artifact. The W4A16-FP8-MTP successor retains MTP via a patched calibration path and delivers 1.49× spec-decode speedup at bs=1.
TP > 2 blocked by vllm-project/vllm#41511 — W4A16 MoE scale-sharding bug.
H200 numbers from older vLLM build — H200 baseline was scored on jasl/vllm@428e08e (harness HEAD 85aca32). Same-software comparison is DGX Spark ↔ RTX PRO 6000; H200 → Blackwell deltas are informational.
toolcall15 TC-06 (Multi-Value Extraction) and TC-08 (Conditional Branching) also fail on the native FP4/FP8 baseline — V4-Flash model-architecture limits, not quantization defects.
2026-05-25: artifact has shipping issues on upstream vLLM. Two problems were surfaced when attempting to load this artifact on (the post-PR-#40923 build the sibling now uses): Same FP8_BLOCK compressor/indexer shipping bug as the MTP sibling — current vLLM constructs those modules as plain BF16 () and the artifact fails with . The MTP sibling fixed this by dequantizing those weights in-artifact to BF16; . A separate architecture-drift issue: the artifact lacks the tensor that current upstream vLLM's DSV4 loader requires (). Either re-calibration that emits this tensor, or a defensive loader patch upstream is needed. (2026-05-05); they do not currently reproduce on bleeding-edge vLLM. Tracking and re-verification deferred to the next session.

Reproduction

Full toolchain, scripts, patches, mission report: canada-quant/dsv4-flash-w4a16-fp8.

Single-file bootstrap (dual DGX Spark, idempotent, SSH-orchestrated):

bash
curl -fsSLO https://raw.githubusercontent.com/canada-quant/dsv4-flash-w4a16-fp8/main/scripts/bootstrap_dsv4_spark.sh
chmod +x bootstrap_dsv4_spark.sh
./bootstrap_dsv4_spark.sh --head-host spark-a --worker-host spark-b

Upstream contributions filed during this work

Table with columns: PR / Issue, Description, Status
PR / Issue	Description	Status
`vllm-project/vllm#41700`	Workspace pre-reservation patch	landed as `jasl/vllm@1d6f5c4`
`vllm-project/vllm#41511`	Marlin MoE TP scale-sharding bug	open — blocks TP>2
`vllm-project/vllm#40991` →

Changes

Table with columns: Date, Change
Date	Change
2026-05-06	DGX Spark TP=2 production canonical at 1M-token context graphs-ON validated on `ds4-sm120-experimental`
2026-05-08	Kylesayrs branch `f910a73a93` force-pushed out of upstream history; vendored content-pinned rebased successor `d09eeb498` at `scripts/kylesayrs-deepseek-ct.patch` (issue #1)
2026-05-19	HumanEval methodology correction: H200 pass@1 was scored at 54.27% under regex extraction; re-scored at 80.49% with `--confirm_run_unsafe_code`. Same artifact, methodology change. Earlier 54.27% number is shown struck through in the quality table
2026-05-23	Workspace pre-reservation patch landed upstream as ; closes our . No local apply needed

Files in the artifact

~30 sharded model-*.safetensors files + model.safetensors.index.json (~143 GB total)
config.json — vLLM-compatible quantization_config (W4A16 + FP8_BLOCK groups)
tokenizer.json, tokenizer_config.json, generation_config.json — upstream DSV4-Flash
recipe.yaml — the llm-compressor calibration recipe
chat_template.jinja — upstream DSV4-Flash (unchanged)
README.md — this file

Citation

bibtex
@misc{canada-quant-dsv4-flash-w4a16-fp8-2026,
  title  = {DeepSeek-V4-Flash W4A16-FP8 for vLLM on Hopper and Blackwell},
  author = {Canada Quant},
  year   = {2026},
  publisher = {Hugging Face},
  url    = {https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8}
}

License

MIT, inherited from upstream deepseek-ai/DeepSeek-V4-Flash.

Acknowledgments

@jasl — DeepSeek-V4 vLLM SM12x base support (PR #40991 → #41834); memory-pressure-release fix e734ace5 that resolved the Blackwell 256K×2 stall.
@kylesayrs — compressed-tensors V4 attention path (PR #41276).
@aabbccddwasd — indexer KV cache layout fix.
@bbbearxyz — SM12x Triton fallback kernels.

DeepSeek-V4-Flash-W4A16-FP8

README

TL;DR

Family / related artifacts

Why this exists

Architecture & precision

Base model

Component precisions

Hardware validated

Benchmarks

Quality

Throughput

Quick start

Quantization recipe

Required calibration environment

What didn't work (recorded so others don't waste cycles)

Recipe

vLLM build

Honest limitations

Reproduction

Upstream contributions filed during this work

Changes

Files in the artifact

Citation

License

Acknowledgments

Explore FriendliAI today

README

TL;DR

Family / related artifacts

Why this exists

Architecture & precision

Base model

Component precisions

Hardware validated

Benchmarks

Quality

Throughput

Quick start

Quantization recipe

Required calibration environment

What didn't work (recorded so others don't waste cycles)

Recipe

vLLM build

Honest limitations

Reproduction

Upstream contributions filed during this work

Changes

Files in the artifact

Citation

License

Acknowledgments