Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherAt a glance
| Base model | nvidia/Kimi-K2.6-NVFP4 |
| Format | NVFP4 |
| Logical params | 519.5B |
| Active / token | ~31B text-path params |
| Experts / MoE layer | 192 routed + 1 shared |
| Active experts / token | 8 routed + 1 shared |
| Layers | 61 total; layer 0 dense + 60 MoE |
| Hidden size | 7168 |
| Context | 262,144 |
| On-disk size | 310 GB |
Repetition-loop attractors
[!WARNING] This is a REAP keep192 checkpoint — 192 of 384 routed experts per layer were removed (50%, uniform across all 60 MoE layers). It recovers most short-form quality but has a repetition-loop pathology on open-ended / long-form generation: the model starts coherent, then collapses into an endless single-token or short-phrase loop and never terminates.
Observed: on a simple open prompt ("tell me about cats and cat allergens") generations degenerated into loops such as saliva saliva saliva…, allergen allergen…, hairs hairs…, and felis-felis-felis…. The attractor token is prompt-dependent — any unbounded prose can trigger one.
- Thinking off → the loop appears directly in the answer.
- Thinking on → the model never closes the reasoning block; it fills the entire context window with un-terminated reasoning and returns empty
content.
Sampling does not reliably fix it. Still looping on long-form with temperature=0 + repetition_penalty=1.12 (clean on short structured tasks only), temperature=0.7 + repetition_penalty=1.05, temperature=0.6 with presence/frequency penalties, and repetition_penalty up to 1.15.
Reliable for (validated): structured JSON, tool/function calling, code, math, short Q&A, and agentic/terminal tasks — i.e. bounded outputs. Vision, thinking, tool-calling, and speculative decoding all work in these modes.
Mitigations: keep outputs bounded (sane max_tokens), prefer structured/agentic use, default repetition_penalty≈1.12, temperature=0, and stop generation if a loop starts. The root fix is restoring some pruned experts from the full nvidia/Kimi-K2.6-NVFP4 — a memory/context tradeoff not applied here.
Pruning
The prune plan was generated with the official REAP survivor semantics:
python
experts_to_prune = torch.topk(saliency, n_experts_to_prune, largest=False)retained_expert_indices = [i for i in range(num_experts) if i not in experts_to_prune]
Source REAP repo: cerebrasresearch/reap at commit 1970473c51ca3caeb98c10392f15b3a08a672974.
Kimi-K2.6's NVFP4 multimodal wrapper is not directly supported by the upstream REAP model registry, so the final safetensor rewrite preserves the official selection/order semantics while using a Kimi-specific NVFP4 shard rewriter. Vision tensors, multimodal projector tensors, tokenizer files, processor files, and chat template files were preserved.
Serving settings
Recommended: vLLM with Eagle3 speculative decoding and the flashinfer_cutlass NVFP4 MoE kernel — the fastest path validated on 4× RTX PRO 6000 Blackwell. Ready-to-run: scripts/serve-vllm.sh.
| Setting | Value |
|---|---|
| Engine | vLLM 0.19.2rc1 (image voipmonitor/vllm:cu130-mtp-tuned-v3-20260423) |
| Tensor parallel / decode-context parallel | 4 / 4 |
| Max model length | 262,144 (GPU KV ~811k tokens, ~3.09× concurrency) |
| Max batched tokens | 8192 |
| Max sequences | 1 |
| Attention backend | TRITON_MLA (FlashInfer-MLA fp8+DCP is unsupported in this build) |
| KV cache dtype | fp8_e4m3 |
| MoE backend | flashinfer_cutlass — fastest NVFP4 MoE on SM120, ~+3.6% decode vs cutlass (flashinfer_trtllm is unsupported there) |
| Speculative decoding | Eagle3 draft lightseekorg/kimi-k2.6-eagle3-mla, 3 tokens, probabilistic, draft KV fp8 |
| Prefix caching | on (--enable-prefix-caching; ~74% hit on repeated prefixes) |
| Custom all-reduce | disabled (PCIe P2P custom all-reduce hangs on this no-NVLink topology) |
| Default sampling | repetition_penalty=1.12, temperature=0 via --override-generation-config |
Measured single-stream (275W power cap, unchanged): decode ~71 tok/s aggregate (up to ~80 on code/structured), prefill ~2070 tok/s, warm TTFT ~130 ms. Decode is power-bound at 275W (clocks throttle 3090 → ~2840 MHz under load), so the speed comes from a cheaper MoE kernel + speculation, not more power. First boot runs flashinfer autotune (~9 min); a persistent JIT-cache volume makes warm boots ~2.5 min.
Sampling: use repetition_penalty=1.12 — not 1.05, which reproduces deterministic loops on short structured outputs — and temperature=0 for structured/agentic work. This does not prevent loops on open-ended generation; see Repetition-loop attractors.
Do not use MAX_NUM_BATCHED_TOKENS=4096 for the 256k server. That setting produced deterministic exclamation-mark loops in the 96k-160k context band. Raising it to 8192 fixed the band without changing the checkpoint.
Validation
Endpoint probes passed with the settings above:
| Probe | Result |
|---|---|
| ASCII JSON | pass |
| Math | pass |
| Unicode echo | pass |
| Python code | pass |
| Structured JSON | pass |
| ASCII-only code | pass |
| Tool call | pass |
| Runtime vision smoke | pass |
| Longer code / JSON generation | pass |
| Decode degeneracy sweep | pass |
| Executable Python coding canary | pass |
| Unicode math table | pass |
| Unicode chart code, subprocess-tested | pass |
| Neutral philosophy explanation with Greek/CJK terms | pass |
| 128k mixed-Unicode JSON/chart/reasoning | pass |
| 32k context | pass |
| 64k context | pass |
| 96k, 112k, 120k, 128k, 136k, 160k | pass |
| 180k, 200k, 225k, 250k | pass |
| 260k near-limit context | pass |
| 128k structured JSON stability | pass |
| 200k Python code stability | pass |
| 250k Unicode/math stability | pass |
The runtime vision smoke used a generated image containing K2-192,
SUM=166, a red square, and a blue square. The model recovered the text,
number, and colored shapes correctly.
The decode degeneracy sweep generated medium-length outputs at short context, 128k, 200k, and 250k. The passing rerun had no repeated-character loop, repeated n-gram loop, duplicate-line loop, Unicode replacement character, or length finish. The 200k code case stopped naturally after 1038 completion tokens. Note: this sweep covers bounded / structured outputs; open-ended free-form generation still degenerates into repetition loops — see Repetition-loop attractors.
The executable coding canary asks for Python solutions and runs them in a subprocess. The passing rerun solved six tasks: two-sum indices, interval merge, record parsing, topological sort with cycle detection, Unicode slugify with Polish character mappings, and an LRU cache.
The Unicode/reasoning canary generated terminal-style Unicode charts, chart
rendering code, comparative religion text, comparative philosophy text, and a
128k-token mixed-Unicode JSON response. It found no Unicode replacement
characters, mojibake markers, illegal control characters, or true repetition
loops. It did surface three copy-fidelity issues that are retained in the trace
dataset: Warsaw became Warszaw/Warszawa, Check_Spark_192 became
Check_SSpark_192, and Arabic توحيد was emitted as توحید with a
Persian/Urdu yeh codepoint.
The near-limit long-context probe passed at 259,943 prompt tokens against the served 262,144-token context limit.
Gate repair was audited by comparing every pruned gate weight and
e_score_correction_bias row against the original source checkpoint row chosen
by the keep192 REAP plan. All 60 layers matched exactly with max absolute
difference 0.0, and the config is repaired to n_routed_experts=192,
num_experts_per_tok=8, n_group=1, and topk_group=1.
Trace rows and pruning artifacts are stored in:
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025},eprint = {2510.13999},archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
nvidia/Kimi-K2.6-NVFP4
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information