0xSero

Kimi-K2.6-519B-NVFP4

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

At a glance

Table

Base model	nvidia/Kimi-K2.6-NVFP4
Format	NVFP4
Logical params	519.5B
Active / token	~31B text-path params
Experts / MoE layer	192 routed + 1 shared
Active experts / token	8 routed + 1 shared
Layers	61 total; layer 0 dense + 60 MoE
Hidden size	7168
Context	262,144
On-disk size	310 GB

Repetition-loop attractors

[!WARNING] This is a REAP keep192 checkpoint — 192 of 384 routed experts per layer were removed (50%, uniform across all 60 MoE layers). It recovers most short-form quality but has a repetition-loop pathology on open-ended / long-form generation: the model starts coherent, then collapses into an endless single-token or short-phrase loop and never terminates.

Observed: on a simple open prompt ("tell me about cats and cat allergens") generations degenerated into loops such as saliva saliva saliva…, allergen allergen…, hairs hairs…, and felis-felis-felis…. The attractor token is prompt-dependent — any unbounded prose can trigger one.

Thinking off → the loop appears directly in the answer.
Thinking on → the model never closes the reasoning block; it fills the entire context window with un-terminated reasoning and returns empty content.

Sampling does not reliably fix it. Still looping on long-form with temperature=0 + repetition_penalty=1.12 (clean on short structured tasks only), temperature=0.7 + repetition_penalty=1.05, temperature=0.6 with presence/frequency penalties, and repetition_penalty up to 1.15.

Reliable for (validated): structured JSON, tool/function calling, code, math, short Q&A, and agentic/terminal tasks — i.e. bounded outputs. Vision, thinking, tool-calling, and speculative decoding all work in these modes.

Mitigations: keep outputs bounded (sane max_tokens), prefer structured/agentic use, default repetition_penalty≈1.12, temperature=0, and stop generation if a loop starts. The root fix is restoring some pruned experts from the full nvidia/Kimi-K2.6-NVFP4 — a memory/context tradeoff not applied here.

Pruning

The prune plan was generated with the official REAP survivor semantics:

python
experts_to_prune = torch.topk(saliency, n_experts_to_prune, largest=False)
retained_expert_indices = [i for i in range(num_experts) if i not in experts_to_prune]

Source REAP repo: cerebrasresearch/reap at commit 1970473c51ca3caeb98c10392f15b3a08a672974.

Kimi-K2.6's NVFP4 multimodal wrapper is not directly supported by the upstream REAP model registry, so the final safetensor rewrite preserves the official selection/order semantics while using a Kimi-specific NVFP4 shard rewriter. Vision tensors, multimodal projector tensors, tokenizer files, processor files, and chat template files were preserved.

Serving settings

Recommended: vLLM with Eagle3 speculative decoding and the flashinfer_cutlass NVFP4 MoE kernel — the fastest path validated on 4× RTX PRO 6000 Blackwell. Ready-to-run: scripts/serve-vllm.sh.

Table with columns: Setting, Value
Setting	Value
Engine	vLLM `0.19.2rc1` (image `voipmonitor/vllm:cu130-mtp-tuned-v3-20260423`)
Tensor parallel / decode-context parallel	4 / 4
Max model length	262,144 (GPU KV ~811k tokens, ~3.09× concurrency)
Max batched tokens	8192
Max sequences	1
Attention backend	`TRITON_MLA` (FlashInfer-MLA fp8+DCP is unsupported in this build)
KV cache dtype

Measured single-stream (275W power cap, unchanged): decode ~71 tok/s aggregate (up to ~80 on code/structured), prefill ~2070 tok/s, warm TTFT ~130 ms. Decode is power-bound at 275W (clocks throttle 3090 → ~2840 MHz under load), so the speed comes from a cheaper MoE kernel + speculation, not more power. First boot runs flashinfer autotune (~9 min); a persistent JIT-cache volume makes warm boots ~2.5 min.

Sampling: use repetition_penalty=1.12 — not 1.05, which reproduces deterministic loops on short structured outputs — and temperature=0 for structured/agentic work. This does not prevent loops on open-ended generation; see Repetition-loop attractors.

Do not use MAX_NUM_BATCHED_TOKENS=4096 for the 256k server. That setting produced deterministic exclamation-mark loops in the 96k-160k context band. Raising it to 8192 fixed the band without changing the checkpoint.

Validation

Endpoint probes passed with the settings above:

Table with columns: Probe, Result
Probe	Result
ASCII JSON	pass
Math	pass
Unicode echo	pass
Python code	pass
Structured JSON	pass
ASCII-only code	pass
Tool call	pass
Runtime vision smoke	pass
Longer code / JSON generation	pass

The runtime vision smoke used a generated image containing K2-192, SUM=166, a red square, and a blue square. The model recovered the text, number, and colored shapes correctly.

The decode degeneracy sweep generated medium-length outputs at short context, 128k, 200k, and 250k. The passing rerun had no repeated-character loop, repeated n-gram loop, duplicate-line loop, Unicode replacement character, or length finish. The 200k code case stopped naturally after 1038 completion tokens. Note: this sweep covers bounded / structured outputs; open-ended free-form generation still degenerates into repetition loops — see Repetition-loop attractors.

The executable coding canary asks for Python solutions and runs them in a subprocess. The passing rerun solved six tasks: two-sum indices, interval merge, record parsing, topological sort with cycle detection, Unicode slugify with Polish character mappings, and an LRU cache.

The Unicode/reasoning canary generated terminal-style Unicode charts, chart rendering code, comparative religion text, comparative philosophy text, and a 128k-token mixed-Unicode JSON response. It found no Unicode replacement characters, mojibake markers, illegal control characters, or true repetition loops. It did surface three copy-fidelity issues that are retained in the trace dataset: Warsaw became Warszaw/Warszawa, Check_Spark_192 became Check_SSpark_192, and Arabic توحيد was emitted as توحید with a Persian/Urdu yeh codepoint.

The near-limit long-context probe passed at 259,943 prompt tokens against the served 262,144-token context limit.

Gate repair was audited by comparing every pruned gate weight and e_score_correction_bias row against the original source checkpoint row chosen by the keep192 REAP plan. All 60 layers matched exactly with max absolute difference 0.0, and the config is repaired to n_routed_experts=192, num_experts_per_tok=8, n_group=1, and topk_group=1.

Trace rows and pruning artifacts are stored in:

0xSero/kimi-k2-6-nvfp4-reap-keep192-endpoint-benchmark-traces-v1

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv}
}

Explore FriendliAI today

Get started Talk to an engineer

At a glance

Table

Base model	nvidia/Kimi-K2.6-NVFP4
Format	NVFP4
Logical params	519.5B
Active / token	~31B text-path params
Experts / MoE layer	192 routed + 1 shared
Active experts / token	8 routed + 1 shared
Layers	61 total; layer 0 dense + 60 MoE
Hidden size	7168
Context	262,144
On-disk size	310 GB

Repetition-loop attractors

[!WARNING] This is a REAP keep192 checkpoint — 192 of 384 routed experts per layer were removed (50%, uniform across all 60 MoE layers). It recovers most short-form quality but has a repetition-loop pathology on open-ended / long-form generation: the model starts coherent, then collapses into an endless single-token or short-phrase loop and never terminates.

Thinking off → the loop appears directly in the answer.
Thinking on → the model never closes the reasoning block; it fills the entire context window with un-terminated reasoning and returns empty content.

Pruning

The prune plan was generated with the official REAP survivor semantics:

python
experts_to_prune = torch.topk(saliency, n_experts_to_prune, largest=False)
retained_expert_indices = [i for i in range(num_experts) if i not in experts_to_prune]

Source REAP repo: cerebrasresearch/reap at commit 1970473c51ca3caeb98c10392f15b3a08a672974.

Serving settings

Table with columns: Setting, Value
Setting	Value
Engine	vLLM `0.19.2rc1` (image `voipmonitor/vllm:cu130-mtp-tuned-v3-20260423`)
Tensor parallel / decode-context parallel	4 / 4
Max model length	262,144 (GPU KV ~811k tokens, ~3.09× concurrency)
Max batched tokens	8192
Max sequences	1
Attention backend	`TRITON_MLA` (FlashInfer-MLA fp8+DCP is unsupported in this build)
KV cache dtype

Validation

Endpoint probes passed with the settings above:

Table with columns: Probe, Result
Probe	Result
ASCII JSON	pass
Math	pass
Unicode echo	pass
Python code	pass
Structured JSON	pass
ASCII-only code	pass
Tool call	pass
Runtime vision smoke	pass
Longer code / JSON generation	pass

The runtime vision smoke used a generated image containing K2-192, SUM=166, a red square, and a blue square. The model recovered the text, number, and colored shapes correctly.

The near-limit long-context probe passed at 259,943 prompt tokens against the served 262,144-token context limit.

Trace rows and pruning artifacts are stored in:

0xSero/kimi-k2-6-nvfp4-reap-keep192-endpoint-benchmark-traces-v1

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025},
  eprint = {2510.13999},
  archivePrefix = {arXiv}
}

Kimi-K2.6-519B-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Repetition-loop attractors

Pruning

Serving settings

Validation

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Repetition-loop attractors

Pruning

Serving settings

Validation

License & citation

Sponsors