At a glance
Table | |
|---|
| Base model | nvidia/Kimi-K2.6-NVFP4 |
| Format | NVFP4 |
| Logical params | 519.5B |
| Active / token | ~31B text-path params |
| Experts / MoE layer | 192 routed + 1 shared |
| Active experts / token | 8 routed + 1 shared |
| Layers | 61 total; layer 0 dense + 60 MoE |
| Hidden size | 7168 |
| Context | 262,144 |
| On-disk size | 310 GB |
Repetition-loop attractors
[!WARNING]
This is a REAP keep192 checkpoint — 192 of 384 routed experts per layer were removed (50%, uniform across all 60 MoE layers). It recovers most short-form quality but has a repetition-loop pathology on open-ended / long-form generation: the model starts coherent, then collapses into an endless single-token or short-phrase loop and never terminates.
Observed: on a simple open prompt ("tell me about cats and cat allergens") generations degenerated into loops such as saliva saliva saliva…, allergen allergen…, hairs hairs…, and felis-felis-felis…. The attractor token is prompt-dependent — any unbounded prose can trigger one.
- Thinking off → the loop appears directly in the answer.
- Thinking on → the model never closes the reasoning block; it fills the entire context window with un-terminated reasoning and returns empty
content.
Sampling does not reliably fix it. Still looping on long-form with temperature=0 + repetition_penalty=1.12 (clean on short structured tasks only), temperature=0.7 + repetition_penalty=1.05, temperature=0.6 with presence/frequency penalties, and repetition_penalty up to 1.15.
Reliable for (validated): structured JSON, tool/function calling, code, math, short Q&A, and agentic/terminal tasks — i.e. bounded outputs. Vision, thinking, tool-calling, and speculative decoding all work in these modes.
Mitigations: keep outputs bounded (sane max_tokens), prefer structured/agentic use, default repetition_penalty≈1.12, temperature=0, and stop generation if a loop starts. The root fix is restoring some pruned experts from the full nvidia/Kimi-K2.6-NVFP4 — a memory/context tradeoff not applied here.
Pruning
The prune plan was generated with the official REAP survivor semantics:
experts_to_prune = torch.topk(saliency, n_experts_to_prune, largest=False)
retained_expert_indices = [i for i in range(num_experts) if i not in experts_to_prune]
Source REAP repo: cerebrasresearch/reap at commit 1970473c51ca3caeb98c10392f15b3a08a672974.
Kimi-K2.6's NVFP4 multimodal wrapper is not directly supported by the upstream REAP model registry, so the final safetensor rewrite preserves the official selection/order semantics while using a Kimi-specific NVFP4 shard rewriter. Vision tensors, multimodal projector tensors, tokenizer files, processor files, and chat template files were preserved.
Serving settings
Recommended: vLLM with Eagle3 speculative decoding and the flashinfer_cutlass NVFP4 MoE kernel — the fastest path validated on 4× RTX PRO 6000 Blackwell. Ready-to-run: scripts/serve-vllm.sh.
Table with columns: Setting, Value| Setting | Value |
|---|
| Engine | vLLM 0.19.2rc1 (image voipmonitor/vllm:cu130-mtp-tuned-v3-20260423) |
| Tensor parallel / decode-context parallel | 4 / 4 |
| Max model length | 262,144 (GPU KV ~811k tokens, ~3.09× concurrency) |
| Max batched tokens | 8192 |
| Max sequences | 1 |
| Attention backend | TRITON_MLA (FlashInfer-MLA fp8+DCP is unsupported in this build) |
| KV cache dtype |
Measured single-stream (275W power cap, unchanged): decode ~71 tok/s aggregate (up to ~80 on code/structured), prefill ~2070 tok/s, warm TTFT ~130 ms. Decode is power-bound at 275W (clocks throttle 3090 → ~2840 MHz under load), so the speed comes from a cheaper MoE kernel + speculation, not more power. First boot runs flashinfer autotune (~9 min); a persistent JIT-cache volume makes warm boots ~2.5 min.
Sampling: use repetition_penalty=1.12 — not 1.05, which reproduces deterministic loops on short structured outputs — and temperature=0 for structured/agentic work. This does not prevent loops on open-ended generation; see Repetition-loop attractors.
Do not use MAX_NUM_BATCHED_TOKENS=4096 for the 256k server. That setting produced deterministic exclamation-mark loops in the 96k-160k context band. Raising it to 8192 fixed the band without changing the checkpoint.
Validation
Endpoint probes passed with the settings above:
Table with columns: Probe, Result| Probe | Result |
|---|
| ASCII JSON | pass |
| Math | pass |
| Unicode echo | pass |
| Python code | pass |
| Structured JSON | pass |
| ASCII-only code | pass |
| Tool call | pass |
| Runtime vision smoke | pass |
| Longer code / JSON generation | pass |
The runtime vision smoke used a generated image containing K2-192,
SUM=166, a red square, and a blue square. The model recovered the text,
number, and colored shapes correctly.
The decode degeneracy sweep generated medium-length outputs at short context,
128k, 200k, and 250k. The passing rerun had no repeated-character loop,
repeated n-gram loop, duplicate-line loop, Unicode replacement character, or
length finish. The 200k code case stopped naturally after 1038 completion
tokens. Note: this sweep covers bounded / structured outputs; open-ended
free-form generation still degenerates into repetition loops — see
Repetition-loop attractors.
The executable coding canary asks for Python solutions and runs them in a
subprocess. The passing rerun solved six tasks: two-sum indices, interval
merge, record parsing, topological sort with cycle detection, Unicode slugify
with Polish character mappings, and an LRU cache.
The Unicode/reasoning canary generated terminal-style Unicode charts, chart
rendering code, comparative religion text, comparative philosophy text, and a
128k-token mixed-Unicode JSON response. It found no Unicode replacement
characters, mojibake markers, illegal control characters, or true repetition
loops. It did surface three copy-fidelity issues that are retained in the trace
dataset: Warsaw became Warszaw/Warszawa, Check_Spark_192 became
Check_SSpark_192, and Arabic توحيد was emitted as توحید with a
Persian/Urdu yeh codepoint.
The near-limit long-context probe passed at 259,943 prompt tokens against the
served 262,144-token context limit.
Gate repair was audited by comparing every pruned gate weight and
e_score_correction_bias row against the original source checkpoint row chosen
by the keep192 REAP plan. All 60 layers matched exactly with max absolute
difference 0.0, and the config is repaired to n_routed_experts=192,
num_experts_per_tok=8, n_group=1, and topk_group=1.
Trace rows and pruning artifacts are stored in:
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025},
eprint = {2510.13999},
archivePrefix = {arXiv}
}
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.