LordNeel

DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

README

License: apache-2.0

Results — single-stream decode TPS, greedy, vs no-MTP baseline

Hardware: 2× NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (sm_120, 96 GB ea., no NVLink, NUMA NODE PCIe), driver 580.126.09, NCCL 2.28.9, jasl/vllm b158e5001 + cherry-picks + Acti's MTP-loader patches.

Headline decode TPS — base vs Acti MTP at 524k and 128k single-stream

Table with columns: Profile, Decode TPS, TTFT, Δ vs base
Profile	Decode TPS	TTFT	Δ vs base
Base (pasta-paul, no MTP), 524k	52.85	91 ms	0% (reference)
This model (v2 GPTQ), 524k	85.52	155 ms	+62% (1.62×)
This model (v2 GPTQ), 128k single-stream	~111	~310 ms	+110% (2.10×)

(v1 RTN release was 83.97 tok/s @ 524k; v2 GPTQ adds ~+2% on top of that. Most of the speedup over base comes from MTP self-speculation; the GPTQ vs RTN delta is the MTP draft acceptance-rate improvement from better expert weights — now measured, see below.)

MTP draft acceptance — measured

The decode speedup above is entirely MTP self-speculation: each model forward drafts one extra token that the main model then verifies in the same step. The draft acceptance rate — the fraction of drafted tokens the verifier keeps — determines how much of the theoretical 2× is realized. These numbers were measured post-release from live vLLM speculative-decoding metrics (vllm:spec_decode_num_accepted_tokens / _num_draft_tokens, extracted by scripts/summarize_spec_decode_metrics.py), not estimated.

MTP draft acceptance by workload

Table with columns: Serving profile, Windows, Accepted / drafted, Weighted acceptance, Mean accepted length
Serving profile	Windows	Accepted / drafted	Weighted acceptance	Mean accepted length
General chat, 128k	5	2,354 / 3,030	77.7%	1.78
General chat, 262k (production)	25	15,274 / 17,991	84.9%	1.85
Long-context research variant, 262k¹	3	319 / 357

num_speculative_tokens = 1, so the ceiling for mean accepted length is 2.0; the production profile realizes 1.85 (92.5% of ceiling).
Acceptance tracks next-token predictability: free-form chat sits at ~78–85%, structured/code-heavy generation climbs into the low-to-mid 90s.
Acceptance affects speed only, never output quality. Every accepted draft is verified by the full model, so emitted text is bit-identical to plain greedy decoding without MTP.

¹ A shared-FP8 research variant (different shared-expert quant, same MTP head); small sample, included for range. ² From the extended-coding reliability gate (candidate v5/v8 runs) and adapter-assisted — reflects a code-specialized runtime configuration, not the bare checkpoint. Listed separately for that reason.

Concurrency profiles — real measurements

The single-stream numbers above are one point on a curve. Below are the multi-user concurrency results from a thread-pool sweep on the same hardware. Different --max-model-len profiles trade off context length against how many concurrent users you can serve.

Aggregate decode TPS by profile

Table with columns: Profile, Max concurrent users, Per-stream TPS at max, Aggregate TPS, Recommended for
Profile	Max concurrent users	Per-stream TPS at max	Aggregate TPS	Recommended for
128k (`--max-model-len 131072 --max-num-seqs 8 --gpu-memory-utilization 0.90`)	8	57	467	High-concurrency chat / agent fleet
256k (`--max-model-len 262144 --max-num-seqs 4 --gpu-memory-utilization 0.95`)	4	76	296	Medium-concurrency, longer docs / RAG

Per-stream scaling within each profile:

Per-stream TPS vs concurrent streams, by profile

Table with columns: N concurrent, 128k profile, 256k profile, 524k profile
N concurrent	128k profile	256k profile	524k profile
1	75	73	87
2	80	75	62
3	(anomalous)	(anomalous)	46
4	76	76	—
8

(The N=3 anomaly is reproducible — likely a vLLM MTP-draft batching artifact at odd counts. Even N values scale cleanly.)

Real-world workload sweep — 15-prompt suite

Beyond the synthetic decode bench, a 15-prompt real-world suite (chat, coding, retrieval, writing, planning, with system prompt + tools enabled, no max_tokens cap) averaged 84.48 tok/s at the 524k profile and 79.82 tok/s at the 640k profile — confirming the headline numbers hold under realistic usage:

Real-world workload TPS — 524k vs 640k

Public-benchmark snapshot

Public benchmark accuracy — GSM8K, HumanEval, MMLU

Table with columns: Benchmark, Sample, Score
Benchmark	Sample	Score
GSM8K (T=0, COT-style, exact-match on `#### N`)	100	93.0% (93/100)
HumanEval pass@1 (T=0, greedy, subprocess-exec tests)	164 (full)	96.3% (158/164)
MMLU (T=0, mixed subjects, max_tokens=128 to leave room for reasoning)	100	53.0% (53/100) ¹
Internal capability eval (math/code/reasoning/knowledge/instruction/longform/tools)	14	13/14 (92.9%)

The HumanEval result is the full 164-problem set with real unit-test execution (greedy T=0, content-only extraction, 60 s test timeout, sandboxed subprocess). Of the 6 failures: 3 were NameError (model named the function differently from the prompt's entry_point), 2 were AssertionError (real wrong answers), 1 was AttributeError (used list.add instead of list.append). Total wall time for the full run: 11.3 min at concurrency 2 on this hardware.

¹ The MMLU sample mixed all 57 subjects uniformly (incl. hard categories like formal_logic, college_*, professional_law). The model's main forward path is identical to pastapaul/DeepSeek-V4-Flash-W4A16-FP8, so headline numbers track that base — refer to base-model evals for tight numbers across full leaderboards.

How to run (production canonical, on this hardware family)

vLLM does not load this model on a vanilla install. You need:

The patched vLLM fork: jasl/vllm at b158e5001 (or ds4-sm120-experimental tip), with the Acti MTP patches applied (one-line additions to deepseek_v4_mtp.py — see Patches below).
The DSV4-Flash-FP8 toolchain: kylesayrs/deepseek-ct cherry-pick, the pasta-paul packed_modules_mapping patch, torch 2.11.0+cu128, compressed-tensors >= 0.15.0.1.
On RTX PRO 6000 Max-Q workstation cards (no NVLink): you must pass --disable-custom-all-reduce — vLLM's CustomAllreduce uses CUDA P2P that deadlocks on this PCIe-only topology, independent of NCCL's NCCL_P2P_DISABLE.

A complete reproducible workspace, including all the patches and a one-shot install script, is at:

https://github.com/pasta-paul/dsv4-flash-w4a16-fp8 (base model + serving stack)

Plus the additional Acti changes documented in the "Acti additions" section below.

Docker deployment (recommended for portability)

A reproducible Docker image build is provided in docker/. The image bakes in the patched vLLM fork, applies all three Acti MTP-loader patches inline via a deterministic patcher script, and exposes a vLLM OpenAI-compatible API on port 8000. Model weights are mounted as a volume so the image stays ~12 GB (instead of ~160 GB).

bash
# 1. Clone the build assets
git clone https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 dsv4
cd dsv4/docker

# 2. Build the image (~25-45 min for CUDA kernel compile)
docker build -t dsv4-flash-acti-mtp:0.1.0 .

# 3. Download the model weights
hf download LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --local-dir $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

# 4. Run (the production 524k config is the default)
docker run --rm --gpus all \
  --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:/models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:ro \
  dsv4-flash-acti-mtp:0.1.0

Switch profile via env vars at run time. Two examples:

bash
# High-concurrency 128k (8 concurrent users, 467 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=131072 -e MAX_NUM_SEQS=8 -e GPU_MEMORY_UTILIZATION=0.90 \
  dsv4-flash-acti-mtp:0.1.0

# Medium 256k (4 concurrent users, 296 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=262144 -e MAX_NUM_SEQS=4 -e GPU_MEMORY_UTILIZATION=0.95 \
  dsv4-flash-acti-mtp:0.1.0

To publish your build to a registry (Docker Hub / ghcr.io / private):

bash
docker tag dsv4-flash-acti-mtp:0.1.0 <your-namespace>/dsv4-flash-acti-mtp:0.1.0
docker push <your-namespace>/dsv4-flash-acti-mtp:0.1.0

The HF repo holds the Dockerfile, entrypoint, patcher script, and a docker-compose example — not the compiled image binary itself (which exceeds practical HF LFS limits). Build it on the host that will serve. See docker/README.md for full details.

Validated 524k serve command (bare-metal vLLM)

bash
vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --served-model-name deepseek-v4-flash deepseek-v4-flash-mtp \
                       DSV4-W4A16-FP8 deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 --block-size 256 \
  --max-model-len 524288 \
  --max-num-seqs 2 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.93 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --host 0.0.0.0 --port 8000

Required env (Acti's working set, includes the small-msg AR latency tuning that drops TTFT from 154 ms to 91 ms on Max-Q):

bash
export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
export NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=0
export NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512

For shorter context with two streams concurrently, drop --max-model-len to 262144 and --max-num-seqs to 2 (still 2× concurrency at 1.7×).

Architecture (unchanged from the base model + MTP head added)

Table with columns: Property, Value
Property	Value
Total parameters	284 B (13 B activated)
Decoder layers	43 + 1 MTP layer
Routed experts / layer	256 (top-K = 6 with `noaux_tc` routing) + 1 shared expert
Hidden size	4096
Routed expert intermediate	2048
Vocab size	129 280
Max position embeddings	1 048 576
`num_nextn_predict_layers`

Quantization scheme (per-tensor)

Table with columns: Component, Format, Method
Component	Format	Method
Routed experts (256 × 43 main layers)	W4A16 INT4 group=128 sym	GPTQ (pasta-paul, `dampening_frac=0.1`)
Routed experts (256 × MTP layer)	W4A16 INT4 group=128 sym	GPTQ (Acti, Frantar-style with Cholesky H⁻¹, damp=0.01)
Attention projections (q/kv/o, compressor, indexer)	FP8_BLOCK 128×128	data-free (carries through from upstream)
MTP attention	FP8_BLOCK 128×128	upstream MX-FP8 → vLLM-format FP8_BLOCK (Acti's renamed scale field)
Shared experts

How the MTP routed-expert GPTQ was done

Boot pasta-paul base model in vLLM with the v1 RTN MTP block + an instrumentation patch that dumps the MTP layer's input arrays (previous_hidden_states, inputs_embeds) to disk per forward call. Run in --enforce-eager (compiled-mode + dynamo doesn't permit the file I/O hook in-graph).
Send 256 ultrachat_200k prompts × max_tokens=256 through the server. ~17.7k MTP forward calls captured, 473k tokens total.
Stop the server. Build a BF16 MTP block in pure PyTorch from the upstream FP8 master (deepseek-ai/DeepSeek-V4-Flash shard 46), dequantizing NVFP4 routed-expert weights via E2M1 lookup × MX-FP8 scale.
Replay each captured (previous_hidden_states, inputs_embeds) through a simplified MTP forward (RMSNorms + h_proj/e_proj real; attention skipped, treated as identity-residual; FFN gate routes tokens to top-K experts; per-expert input activations gathered).
For each of 256 experts × 3 projections (w1/w3/w2):

Per-expert calibration tokens: min=430, max=175,375, mean=11,094. The min is borderline for a 4096-dim Hessian; those experts will be slightly less optimal but the damping factor compensates. The simplification of skipping attention during replay is a pragmatic compromise vs. re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch (it gives a less-perfect input distribution but still beats RTN by a measurable margin).

Acti additions (vs `pastapaul/DeepSeek-V4-Flash-W4A16-FP8`)

MTP weights present. 768 routed-expert tensors (mtp.0.ffn.experts.{0..255}.{w1,w2,w3}) packed in compressed-tensors W4A16 INT4 (group=128 sym), 5 attention projections at FP8_BLOCK with the field rename, 3 BF16 shared experts, and the BF16/FP32 norms / e_proj / h_proj / enorm / hnorm / attn_norm / attn_sink / gate / hc_* tensors. New shard: model-mtp-w4a16.safetensors (3.55 GB).
Updated quantization_config.ignore that excludes the MTP non-quantized layer prefixes (layers.43.{e_proj, h_proj, shared_head.head, shared_experts.{w1,w2,w3}} and mtp_block.* aliases) from the W4A16/FP8 groups while keeping the routed experts in the W4A16 group_1 regex.
vLLM patches (mirror these into your fork; total ~30 lines):

Provenance

This checkpoint was constructed by:

Starting from pastapaul/DeepSeek-V4-Flash-W4A16-FP8 (4 safetensors shards, ~143 GB).
Pulling shard 46 of deepseek-ai/DeepSeek-V4-Flash (the upstream FP8 master) — that shard contains all 1575 mtp.0.* tensors.
For each of the 768 MTP routed-expert tensors: dequantize NVFP4 → BF16, then run Frantar-style GPTQ using calibration activations captured from a live serving of pasta-paul + v1 RTN MTP. Pack via compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32.
For each of the 5 MTP attention projections: keep upstream's FP8 weight as-is, decode the upstream MX-FP8 scale to BF16 and rename .scale → .weight_scale (pasta-paul convention).
For BF16 / FP32 / shared-expert tensors: dequantize FP8 to BF16 where applicable, otherwise pass through.
Write a single new shard model-mtp-w4a16.safetensors, hardlink the 4 base shards into the new dir, write a union safetensors index and an updated config with the rebuilt ignore list.

The full pipeline is reproducible in ~30 minutes total: ~25 min calibration capture (eager-mode serve + 256 prompts), ~3 min Hessian accumulation + GPTQ on a single GPU, ~2 min splice + manifest writeout.

Files in this repo

model-{00001..00004}-of-00004.safetensors — pasta-paul's base shards (W4A16 routed experts, FP8_BLOCK attention, BF16 shared experts) — redistributed under Apache 2.0 with attribution.
model-mtp-w4a16.safetensors — Acti's MTP layer (3.55 GB; GPTQ-quantized in v2).
model.safetensors.index.json — union index.
config.json — base config + rebuilt quantization_config.ignore for MTP submodules.
MTP_QUANT_MANIFEST.{json,md} — declares is_final_quality_preserving: true (real GPTQ).
recipe.yaml — pasta-paul's GPTQ recipe (for the base layers; included for transparency).
tokenizer.json, , — unchanged from base.

Limitations

Hardware: validated only on 2× RTX PRO 6000 Blackwell Max-Q (sm_120). Should also work on RTX PRO 6000 Server, DGX Spark / GB10, and 8× H200 — same code path as base. Without --disable-custom-all-reduce, Max-Q deadlocks at post-graph eager warmup.
TP: TP=2 only. TP=1 OOMs on a single 96 GB GPU; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
MTP quality: GPTQ pipeline used a simplified MTP forward that skipped the attention layer in calibration (used identity-residual). This is a known approximation chosen to avoid re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch — it captures the post-FFN-norm input distribution to each expert, but the attention's contribution to that distribution is approximated. The output produced by the served model is unaffected (the main model verifies every accepted draft); only the MTP draft accept rate is impacted — now measured at 77.7% @128k / 84.9% @262k production (see MTP draft acceptance above). Lift on this hardware vs. v1 RTN: ~+2% decode TPS on long-context profile.
num_speculative_tokens: capped at 1 because DSV4-Flash ships exactly one MTP head (num_nextn_predict_layers=1). Higher values would not produce more draft tokens.
Reasoning parser: with --reasoning-parser deepseek_v4, model output is split into and — applications that read only will see empty strings for "thinking" responses. Adjust accordingly.

Future work

A higher-fidelity GPTQ pass on the MTP block would re-implement DSV4 hybrid attention (CSA+HCA) in pure PyTorch (or hook it directly out of vLLM's compiled forward) so calibration sees the real post-attention residual distribution rather than the identity-residual approximation we use today. Estimated additional lift: ~+5-10% decode TPS, requiring a ~1-day code change. Documented in the project's MTP_LOCAL_REQUANT_STATUS.md follow-ups.

Citation / attribution

If this model helps you, please cite both:

DeepSeek-AI for deepseek-ai/DeepSeek-V4-Flash — the source model.
pasta-paul for pastapaul/DeepSeek-V4-Flash-W4A16-FP8 — the W4A16 GPTQ quantization and the validated jasl/vllm serving stack (repo).
Acti for the MTP-head requantization (RTN v1, GPTQ v2) and the vLLM-MTP loader patches in this repo.

License: Apache 2.0 (matching upstream).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

LordNeel

Model Tree

Base

canada-quant/DeepSeek-V4-Flash-W4A16-FP8

Quantized

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Results — single-stream decode TPS, greedy, vs no-MTP baseline

Headline decode TPS — base vs Acti MTP at 524k and 128k single-stream

Table with columns: Profile, Decode TPS, TTFT, Δ vs base
Profile	Decode TPS	TTFT	Δ vs base
Base (pasta-paul, no MTP), 524k	52.85	91 ms	0% (reference)
This model (v2 GPTQ), 524k	85.52	155 ms	+62% (1.62×)
This model (v2 GPTQ), 128k single-stream	~111	~310 ms	+110% (2.10×)

MTP draft acceptance — measured

MTP draft acceptance by workload

Table with columns: Serving profile, Windows, Accepted / drafted, Weighted acceptance, Mean accepted length
Serving profile	Windows	Accepted / drafted	Weighted acceptance	Mean accepted length
General chat, 128k	5	2,354 / 3,030	77.7%	1.78
General chat, 262k (production)	25	15,274 / 17,991	84.9%	1.85
Long-context research variant, 262k¹	3	319 / 357

num_speculative_tokens = 1, so the ceiling for mean accepted length is 2.0; the production profile realizes 1.85 (92.5% of ceiling).
Acceptance tracks next-token predictability: free-form chat sits at ~78–85%, structured/code-heavy generation climbs into the low-to-mid 90s.
Acceptance affects speed only, never output quality. Every accepted draft is verified by the full model, so emitted text is bit-identical to plain greedy decoding without MTP.

Concurrency profiles — real measurements

Aggregate decode TPS by profile

Table with columns: Profile, Max concurrent users, Per-stream TPS at max, Aggregate TPS, Recommended for
Profile	Max concurrent users	Per-stream TPS at max	Aggregate TPS	Recommended for
128k (`--max-model-len 131072 --max-num-seqs 8 --gpu-memory-utilization 0.90`)	8	57	467	High-concurrency chat / agent fleet
256k (`--max-model-len 262144 --max-num-seqs 4 --gpu-memory-utilization 0.95`)	4	76	296	Medium-concurrency, longer docs / RAG

Per-stream scaling within each profile:

Per-stream TPS vs concurrent streams, by profile

Table with columns: N concurrent, 128k profile, 256k profile, 524k profile
N concurrent	128k profile	256k profile	524k profile
1	75	73	87
2	80	75	62
3	(anomalous)	(anomalous)	46
4	76	76	—
8

(The N=3 anomaly is reproducible — likely a vLLM MTP-draft batching artifact at odd counts. Even N values scale cleanly.)

Real-world workload sweep — 15-prompt suite

Real-world workload TPS — 524k vs 640k

Public-benchmark snapshot

Public benchmark accuracy — GSM8K, HumanEval, MMLU

Table with columns: Benchmark, Sample, Score
Benchmark	Sample	Score
GSM8K (T=0, COT-style, exact-match on `#### N`)	100	93.0% (93/100)
HumanEval pass@1 (T=0, greedy, subprocess-exec tests)	164 (full)	96.3% (158/164)
MMLU (T=0, mixed subjects, max_tokens=128 to leave room for reasoning)	100	53.0% (53/100) ¹
Internal capability eval (math/code/reasoning/knowledge/instruction/longform/tools)	14	13/14 (92.9%)

How to run (production canonical, on this hardware family)

vLLM does not load this model on a vanilla install. You need:

The patched vLLM fork: jasl/vllm at b158e5001 (or ds4-sm120-experimental tip), with the Acti MTP patches applied (one-line additions to deepseek_v4_mtp.py — see Patches below).
The DSV4-Flash-FP8 toolchain: kylesayrs/deepseek-ct cherry-pick, the pasta-paul packed_modules_mapping patch, torch 2.11.0+cu128, compressed-tensors >= 0.15.0.1.
On RTX PRO 6000 Max-Q workstation cards (no NVLink): you must pass --disable-custom-all-reduce — vLLM's CustomAllreduce uses CUDA P2P that deadlocks on this PCIe-only topology, independent of NCCL's NCCL_P2P_DISABLE.

A complete reproducible workspace, including all the patches and a one-shot install script, is at:

https://github.com/pasta-paul/dsv4-flash-w4a16-fp8 (base model + serving stack)

Plus the additional Acti changes documented in the "Acti additions" section below.

Docker deployment (recommended for portability)

bash
# 1. Clone the build assets
git clone https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 dsv4
cd dsv4/docker

# 2. Build the image (~25-45 min for CUDA kernel compile)
docker build -t dsv4-flash-acti-mtp:0.1.0 .

# 3. Download the model weights
hf download LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --local-dir $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

# 4. Run (the production 524k config is the default)
docker run --rm --gpus all \
  --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p 8000:8000 \
  -v $HOME/dsv4-models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:/models/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8:ro \
  dsv4-flash-acti-mtp:0.1.0

Switch profile via env vars at run time. Two examples:

bash
# High-concurrency 128k (8 concurrent users, 467 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=131072 -e MAX_NUM_SEQS=8 -e GPU_MEMORY_UTILIZATION=0.90 \
  dsv4-flash-acti-mtp:0.1.0

# Medium 256k (4 concurrent users, 296 tok/s aggregate)
docker run ... \
  -e MAX_MODEL_LEN=262144 -e MAX_NUM_SEQS=4 -e GPU_MEMORY_UTILIZATION=0.95 \
  dsv4-flash-acti-mtp:0.1.0

To publish your build to a registry (Docker Hub / ghcr.io / private):

bash
docker tag dsv4-flash-acti-mtp:0.1.0 <your-namespace>/dsv4-flash-acti-mtp:0.1.0
docker push <your-namespace>/dsv4-flash-acti-mtp:0.1.0

Validated 524k serve command (bare-metal vLLM)

bash
vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \
  --served-model-name deepseek-v4-flash deepseek-v4-flash-mtp \
                       DSV4-W4A16-FP8 deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 --block-size 256 \
  --max-model-len 524288 \
  --max-num-seqs 2 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.93 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --host 0.0.0.0 --port 8000

Required env (Acti's working set, includes the small-msg AR latency tuning that drops TTFT from 154 ms to 91 ms on Max-Q):

bash
export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
export NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 NCCL_SHM_DISABLE=0
export NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512

For shorter context with two streams concurrently, drop --max-model-len to 262144 and --max-num-seqs to 2 (still 2× concurrency at 1.7×).

Architecture (unchanged from the base model + MTP head added)

Table with columns: Property, Value
Property	Value
Total parameters	284 B (13 B activated)
Decoder layers	43 + 1 MTP layer
Routed experts / layer	256 (top-K = 6 with `noaux_tc` routing) + 1 shared expert
Hidden size	4096
Routed expert intermediate	2048
Vocab size	129 280
Max position embeddings	1 048 576
`num_nextn_predict_layers`

Quantization scheme (per-tensor)

Table with columns: Component, Format, Method
Component	Format	Method
Routed experts (256 × 43 main layers)	W4A16 INT4 group=128 sym	GPTQ (pasta-paul, `dampening_frac=0.1`)
Routed experts (256 × MTP layer)	W4A16 INT4 group=128 sym	GPTQ (Acti, Frantar-style with Cholesky H⁻¹, damp=0.01)
Attention projections (q/kv/o, compressor, indexer)	FP8_BLOCK 128×128	data-free (carries through from upstream)
MTP attention	FP8_BLOCK 128×128	upstream MX-FP8 → vLLM-format FP8_BLOCK (Acti's renamed scale field)
Shared experts

How the MTP routed-expert GPTQ was done

Boot pasta-paul base model in vLLM with the v1 RTN MTP block + an instrumentation patch that dumps the MTP layer's input arrays (previous_hidden_states, inputs_embeds) to disk per forward call. Run in --enforce-eager (compiled-mode + dynamo doesn't permit the file I/O hook in-graph).
Send 256 ultrachat_200k prompts × max_tokens=256 through the server. ~17.7k MTP forward calls captured, 473k tokens total.
Stop the server. Build a BF16 MTP block in pure PyTorch from the upstream FP8 master (deepseek-ai/DeepSeek-V4-Flash shard 46), dequantizing NVFP4 routed-expert weights via E2M1 lookup × MX-FP8 scale.
Replay each captured (previous_hidden_states, inputs_embeds) through a simplified MTP forward (RMSNorms + h_proj/e_proj real; attention skipped, treated as identity-residual; FFN gate routes tokens to top-K experts; per-expert input activations gathered).
For each of 256 experts × 3 projections (w1/w3/w2):

Acti additions (vs `pastapaul/DeepSeek-V4-Flash-W4A16-FP8`)

MTP weights present. 768 routed-expert tensors (mtp.0.ffn.experts.{0..255}.{w1,w2,w3}) packed in compressed-tensors W4A16 INT4 (group=128 sym), 5 attention projections at FP8_BLOCK with the field rename, 3 BF16 shared experts, and the BF16/FP32 norms / e_proj / h_proj / enorm / hnorm / attn_norm / attn_sink / gate / hc_* tensors. New shard: model-mtp-w4a16.safetensors (3.55 GB).
Updated quantization_config.ignore that excludes the MTP non-quantized layer prefixes (layers.43.{e_proj, h_proj, shared_head.head, shared_experts.{w1,w2,w3}} and mtp_block.* aliases) from the W4A16/FP8 groups while keeping the routed experts in the W4A16 group_1 regex.
vLLM patches (mirror these into your fork; total ~30 lines):

Provenance

This checkpoint was constructed by:

Starting from pastapaul/DeepSeek-V4-Flash-W4A16-FP8 (4 safetensors shards, ~143 GB).
Pulling shard 46 of deepseek-ai/DeepSeek-V4-Flash (the upstream FP8 master) — that shard contains all 1575 mtp.0.* tensors.
For each of the 768 MTP routed-expert tensors: dequantize NVFP4 → BF16, then run Frantar-style GPTQ using calibration activations captured from a live serving of pasta-paul + v1 RTN MTP. Pack via compressed_tensors.compressors.pack_quantized.helpers.pack_to_int32.
For each of the 5 MTP attention projections: keep upstream's FP8 weight as-is, decode the upstream MX-FP8 scale to BF16 and rename .scale → .weight_scale (pasta-paul convention).
For BF16 / FP32 / shared-expert tensors: dequantize FP8 to BF16 where applicable, otherwise pass through.
Write a single new shard model-mtp-w4a16.safetensors, hardlink the 4 base shards into the new dir, write a union safetensors index and an updated config with the rebuilt ignore list.

Files in this repo

model-{00001..00004}-of-00004.safetensors — pasta-paul's base shards (W4A16 routed experts, FP8_BLOCK attention, BF16 shared experts) — redistributed under Apache 2.0 with attribution.
model-mtp-w4a16.safetensors — Acti's MTP layer (3.55 GB; GPTQ-quantized in v2).
model.safetensors.index.json — union index.
config.json — base config + rebuilt quantization_config.ignore for MTP submodules.
MTP_QUANT_MANIFEST.{json,md} — declares is_final_quality_preserving: true (real GPTQ).
recipe.yaml — pasta-paul's GPTQ recipe (for the base layers; included for transparency).
tokenizer.json, , — unchanged from base.

Limitations

Hardware: validated only on 2× RTX PRO 6000 Blackwell Max-Q (sm_120). Should also work on RTX PRO 6000 Server, DGX Spark / GB10, and 8× H200 — same code path as base. Without --disable-custom-all-reduce, Max-Q deadlocks at post-graph eager warmup.
TP: TP=2 only. TP=1 OOMs on a single 96 GB GPU; TP≥4 hits an upstream W4A16 MoE scale-sharding bug (vllm-project/vllm#41511).
MTP quality: GPTQ pipeline used a simplified MTP forward that skipped the attention layer in calibration (used identity-residual). This is a known approximation chosen to avoid re-implementing DSV4 hybrid attention (CSA+HCA) in pure PyTorch — it captures the post-FFN-norm input distribution to each expert, but the attention's contribution to that distribution is approximated. The output produced by the served model is unaffected (the main model verifies every accepted draft); only the MTP draft accept rate is impacted — now measured at 77.7% @128k / 84.9% @262k production (see MTP draft acceptance above). Lift on this hardware vs. v1 RTN: ~+2% decode TPS on long-context profile.
num_speculative_tokens: capped at 1 because DSV4-Flash ships exactly one MTP head (num_nextn_predict_layers=1). Higher values would not produce more draft tokens.
Reasoning parser: with --reasoning-parser deepseek_v4, model output is split into and — applications that read only will see empty strings for "thinking" responses. Adjust accordingly.

Future work

Citation / attribution

If this model helps you, please cite both:

DeepSeek-AI for deepseek-ai/DeepSeek-V4-Flash — the source model.
pasta-paul for pastapaul/DeepSeek-V4-Flash-W4A16-FP8 — the W4A16 GPTQ quantization and the validated jasl/vllm serving stack (repo).
Acti for the MTP-head requantization (RTN v1, GPTQ v2) and the vLLM-MTP loader patches in this repo.

License: Apache 2.0 (matching upstream).

DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8

README

Results — single-stream decode TPS, greedy, vs no-MTP baseline

MTP draft acceptance — measured

Concurrency profiles — real measurements

Real-world workload sweep — 15-prompt suite

Public-benchmark snapshot

How to run (production canonical, on this hardware family)

Docker deployment (recommended for portability)

Validated 524k serve command (bare-metal vLLM)

Architecture (unchanged from the base model + MTP head added)

Quantization scheme (per-tensor)

How the MTP routed-expert GPTQ was done

Acti additions (vs pastapaul/DeepSeek-V4-Flash-W4A16-FP8)

Provenance

Files in this repo

Limitations

Future work

Citation / attribution

Explore FriendliAI today

README

Results — single-stream decode TPS, greedy, vs no-MTP baseline

MTP draft acceptance — measured

Concurrency profiles — real measurements

Real-world workload sweep — 15-prompt suite

Public-benchmark snapshot

How to run (production canonical, on this hardware family)

Docker deployment (recommended for portability)

Validated 524k serve command (bare-metal vLLM)

Architecture (unchanged from the base model + MTP head added)

Quantization scheme (per-tensor)

How the MTP routed-expert GPTQ was done

Acti additions (vs pastapaul/DeepSeek-V4-Flash-W4A16-FP8)

Provenance

Files in this repo

Limitations

Future work

Citation / attribution

Acti additions (vs `pastapaul/DeepSeek-V4-Flash-W4A16-FP8`)

Acti additions (vs `pastapaul/DeepSeek-V4-Flash-W4A16-FP8`)