Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

One-command Spark install

Run this on the DGX Spark. HF_TOKEN is only required if the model repo is private or not already cached on the machine.

bash

HF_TOKEN=... bash -lc 'set -euo pipefail; cd /home/sero/spark; rm -rf deepseek-v4-flash-spark-200k; git clone https://github.com/0xSero/deepseek-v4-flash-spark-200k.git; cd deepseek-v4-flash-spark-200k; ./install.sh --profile k160-mtp2-200k --launch'

Do not commit tokens into the repo or a model card. Pass them only through the environment for the one command above.

Exact working profile

The profile lives at configs/k160-mtp2-200k.env in the GitHub repo.

bash

MODEL_REPO=0xSero/DeepSeek-V4-Flash-180B-codex-K160-REAP
MODEL_REVISION=7c360e1cd4a5168099dbc54d16d929bf6df04990
SERVED_MODEL_NAME=deepseek-v4-flash-k160-g27-cutlass451-mtp2
CONTEXT_LENGTH=200000
KV_CACHE_MEMORY_BYTES=6G
MAX_NUM_BATCHED_TOKENS=4096
MAX_NUM_SEQS=1
GPU_MEMORY_UTILIZATION=0.88
WATCHDOG_MIN_AVAILABLE_KB=6291456
KV_CACHE_DTYPE=fp8
ENFORCE_EAGER=0
THINKING=false
SPECULATIVE_CONFIG='{"method":"deepseek_mtp","num_speculative_tokens":2}'
VLLM_ENABLE_DEEPSEEK_V4_SPARSE_MLA_WARMUP=0
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1

The launcher enables DeepSeek V4 tokenizer, reasoning parser, tool-call parser, prefix caching, FP8 KV, MTP speculative decoding, and CUDA graphs. Do not add --enforce-eager; this profile was validated with CUDA graph capture enabled.

Docker runtime

The registry target is:

text

ghcr.io/0xsero/deepseek-v4-flash-spark-vllm:cutlass451-g27

The image lineage is the DGX Spark DeepSeek V4 vLLM build vllm-node-dsv4:latest with vLLM 0.1.dev17016+g27fd665bd.d20260526 and nvidia-cutlass-dsl[cu13]==4.5.1. The installer tags the pulled image as vllm-node-dsv4-cutlass451:latest.

The exact image validated on spark-2822 is:

text

vllm-node-dsv4-cutlass451:latest
sha256:5df60ebb9c10dfb86d5946cae8244adfe65a7fd405401bd542ecf22d5c497a4a

If GHCR anonymous manifest access returns denied, the image has not been package-published yet. In that case the installer uses the already-cached local image or builds vllm-node-dsv4-cutlass451:latest from a local vllm-node-dsv4:latest base.

The repo also carries the runtime patcher used during validation. It applies the nonstandard REAP expert-count router fallback, MXFP4 memory hygiene, optional cute-dsl override hook, and a FlashInfer CUDA IPC libcudart fix. It does not modify model weights.

Validation

Validation was run on spark-2822, a single DGX Spark / GB10 / SM121 machine, on May 27 2026.

Startup evidence:

text

MTP draft model loaded: 39 params
Model loading took 96.66 GiB memory
GPU KV cache size: 537,516 tokens
Maximum concurrency for 200,000 tokens per request: 2.69x
Graph capturing finished in about 20 seconds and used about 1.66 GiB

Full 200K long-needle benchmark:

text

run_dir: /home/sero/spark/benchmarks/deepseek-reap/single-server-sweep/k160-mtp2-200k-mnbt4096-kv6g-20260527T192208Z
prompt_tokens: 186,390
TTFT: 362.573 s
prefill: 514.075 tok/s
decode: 24.378 tok/s
needle_retained: true
watchdog_kill: false

Fixed 200K long-coding benchmark:

text

run_dir: /home/sero/spark/benchmarks/deepseek-reap/single-server-sweep/k160-mtp2-200k-longcoding-fixed-20260527T194241Z
prompt_tokens: 182,112
TTFT: 353.799 s
prefill: 514.733 tok/s
decode: 18.946 tok/s
mentions_off_by_one: true
watchdog_kill: false

Task coverage at 200K included smoke, ASCII, Unicode, Mermaid, code explanation, religion/philosophy prompts, tool-call fidelity, long-needle retrieval, and long-code review. Smoke, ASCII, Unicode, Mermaid, code, religion, tool-call fidelity, and long-needle passed. A few qualitative rubrics missed narrow fields at 128 output tokens, so benchmark prompts should reserve more completion tokens when judging broad reasoning quality.

Why this is the default profile

K160 with MTP2 was the best single-Spark balance found so far: it kept the 200K path alive without a watchdog kill and roughly doubled decode speed versus no-spec in comparable long-context tests. The 6G KV pool and 4096-token prefill chunks leave enough room for the weights, DeepGEMM/CUDA graph workspaces, and activations on a 121 GiB usable-memory GB10 system.

Intended use

This model card is for experimental local inference and reproducibility of the DGX Spark REAP serving recipe. The model is a pruned/quantized DeepSeek V4 Flash derivative; evaluate behavior and license obligations before production use.

Model provider

0xSero

Model tree

Base

deepseek-ai/DeepSeek-V4-Flash

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today