Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitOne-command Spark install
Run this on the DGX Spark. HF_TOKEN is only required if the model repo is private or not already cached on the machine.
bash
HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-v4-flash-spark-200k; git clone https://github.com/0xSero/deepseek-v4-flash-spark-200k.git; cd deepseek-v4-flash-spark-200k; ./install.sh --profile k160-mtp2-200k --launch'
Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.
Exact working profile
The profile lives at configs/k160-mtp2-200k.env in the GitHub repo.
bash
MODEL_REPO=0xSero/DeepSeek-V4-Flash-180B-codex-K160-REAPMODEL_REVISION=7c360e1cd4a5168099dbc54d16d929bf6df04990SERVED_MODEL_NAME=deepseek-v4-flash-k160-g27-cutlass451-mtp2CONTEXT_LENGTH=200000KV_CACHE_MEMORY_BYTES=6GMAX_NUM_BATCHED_TOKENS=4096MAX_NUM_SEQS=1GPU_MEMORY_UTILIZATION=0.88WATCHDOG_MIN_AVAILABLE_KB=6291456KV_CACHE_DTYPE=fp8ENFORCE_EAGER=0THINKING=falseSPECULATIVE_CONFIG='{"method":"deepseek_mtp","num_speculative_tokens":2}'VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, MTP speculative decoding, and CUDA graphs. Do not add --enforce-eager; this profile was validated with CUDA graph capture enabled.
Docker runtime
The registry target is:
text
ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27
The image lineage is the DGX Spark DeepSeek V4 vLLM build vllm-node-dsv4:latest with vLLM 0.1.dev17016+g27fd665bd.d20260526 and nvidia-cutlass-dsl[cu13]==4.5.1. The installer tags the pulled image as vllm-node-dsv4-cutlass451:latest.
The exact image validated on spark-2822 is:
text
vllm-node-dsv4-cutlass451:latestsha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a
If GHCR anonymous manifest access returns denied, the image has not been package-published yet. In that case the installer uses the already-cached local image or builds vllm-node-dsv4-cutlass451:latest from a local vllm-node-dsv4:latest base.
The repo also carries the runtime patcher used during validation. It applies the nonstandard REAP expert-count router fallback, MXFP4 memory hygiene, optional cute-dsl override hook, and a FlashInfer CUDA IPC libcudart fix. It does not modify model weights.
Validation
Validation was run on spark-2822, a single DGX Spark / GB10 / SM121 machine, on May 27 2026.
Startup evidence:
text
MTP draft model loaded: 39 paramsModel loading took 96.66 GiB memoryGPU KV cache size: 537,516 tokensMaximum concurrency for 200,000 tokens per request: 2.69xGraph capturing finished in about 20 seconds and used about 1.66 GiB
Full 200K long-needle benchmark:
text
run_dir: /home/sero/spark/benchmarks/deepseek-reap/single-server-sweep/k160-mtp2-200k-mnbt4096-kv6g-20260527T192208Zprompt_tokens: 186,390TTFT: 362.573 sprefill: 514.075 tok/sdecode: 24.378 tok/sneedle_retained: truewatchdog_kill: false
Fixed 200K long-coding benchmark:
text
run_dir: /home/sero/spark/benchmarks/deepseek-reap/single-server-sweep/k160-mtp2-200k-longcoding-fixed-20260527T194241Zprompt_tokens: 182,112TTFT: 353.799 sprefill: 514.733 tok/sdecode: 18.946 tok/smentions_off_by_one: truewatchdog_kill: false
Task coverage at 200K included smoke, ASCII, Unicode, Mermaid, code explanation, religion/philosophy prompts, tool-call fidelity, long-needle retrieval, and long-code review. Smoke, ASCII, Unicode, Mermaid, code, religion, tool-call fidelity, and long-needle passed. A few qualitative rubrics missed narrow fields at 128 output tokens, so benchmark prompts should reserve more completion tokens when judging broad reasoning quality.
Why this is the default profile
K160 with MTP2 was the best single-Spark balance found so far: it kept the 200K path alive without a watchdog kill and roughly doubled decode speed versus no-spec in comparable long-context tests. The 6G KV pool and 4096-token prefill chunks leave enough room for the weights, DeepGEMM/CUDA graph workspaces, and activations on a 121 GiB usable-memory GB10 system.
Intended use
This model card is for experimental local inference and reproducibility of the DGX Spark REAP serving recipe. The model is a pruned/quantized DeepSeek V4 Flash derivative; evaluate behavior and license obligations before production use.
Model provider
0xSero
Model tree
Base
deepseek-ai/DeepSeek-V4-Flash
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information