Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

At a glance

Base modelnvidia/Kimi-K2.6-NVFP4
FormatNVFP4
Logical params519.5B
Active / token~31B text-path params
Experts / MoE layer192 routed + 1 shared
Active experts / token8 routed + 1 shared
Layers61 total; layer 0 dense + 60 MoE
Hidden size7168
Context262,144
On-disk size310 GB

Repetition-loop attractors

[!WARNING] This is a REAP keep192 checkpoint — 192 of 384 routed experts per layer were removed (50%, uniform across all 60 MoE layers). It recovers most short-form quality but has a repetition-loop pathology on open-ended / long-form generation: the model starts coherent, then collapses into an endless single-token or short-phrase loop and never terminates.

Observed: on a simple open prompt ("tell me about cats and cat allergens") generations degenerated into loops such as saliva saliva saliva…, allergen allergen…, hairs hairs…, and felis-felis-felis…. The attractor token is prompt-dependent — any unbounded prose can trigger one.

  • Thinking off → the loop appears directly in the answer.
  • Thinking on → the model never closes the reasoning block; it fills the entire context window with un-terminated reasoning and returns empty content.

Sampling does not reliably fix it. Still looping on long-form with temperature=0 + repetition_penalty=1.12 (clean on short structured tasks only), temperature=0.7 + repetition_penalty=1.05, temperature=0.6 with presence/frequency penalties, and repetition_penalty up to 1.15.

Reliable for (validated): structured JSON, tool/function calling, code, math, short Q&A, and agentic/terminal tasks — i.e. bounded outputs. Vision, thinking, tool-calling, and speculative decoding all work in these modes.

Mitigations: keep outputs bounded (sane max_tokens), prefer structured/agentic use, default repetition_penalty≈1.12, temperature=0, and stop generation if a loop starts. The root fix is restoring some pruned experts from the full nvidia/Kimi-K2.6-NVFP4 — a memory/context tradeoff not applied here.

Pruning

The prune plan was generated with the official REAP survivor semantics:

python

experts_to_prune = torch.topk(saliency, n_experts_to_prune, largest=False)
retained_expert_indices = [i for i in range(num_experts) if i not in experts_to_prune]

Source REAP repo: cerebrasresearch/reap at commit 1970473c51ca3caeb98c10392f15b3a08a672974.

Kimi-K2.6's NVFP4 multimodal wrapper is not directly supported by the upstream REAP model registry, so the final safetensor rewrite preserves the official selection/order semantics while using a Kimi-specific NVFP4 shard rewriter. Vision tensors, multimodal projector tensors, tokenizer files, processor files, and chat template files were preserved.

Serving settings

Recommended: vLLM with Eagle3 speculative decoding and the flashinfer_cutlass NVFP4 MoE kernel — the fastest path validated on 4× RTX PRO 6000 Blackwell. Ready-to-run: scripts/serve-vllm.sh.

SettingValue
EnginevLLM 0.19.2rc1 (image voipmonitor/vllm:cu130-mtp-tuned-v3-20260423)
Tensor parallel / decode-context parallel4 / 4
Max model length262,144 (GPU KV ~811k tokens, ~3.09× concurrency)
Max batched tokens8192
Max sequences1
Attention backendTRITON_MLA (FlashInfer-MLA fp8+DCP is unsupported in this build)
KV cache dtypefp8_e4m3
MoE backendflashinfer_cutlass — fastest NVFP4 MoE on SM120, ~+3.6% decode vs cutlass (flashinfer_trtllm is unsupported there)
Speculative decodingEagle3 draft lightseekorg/kimi-k2.6-eagle3-mla, 3 tokens, probabilistic, draft KV fp8
Prefix cachingon (--enable-prefix-caching; ~74% hit on repeated prefixes)
Custom all-reducedisabled (PCIe P2P custom all-reduce hangs on this no-NVLink topology)
Default samplingrepetition_penalty=1.12, temperature=0 via --override-generation-config

Measured single-stream (275W power cap, unchanged): decode ~71 tok/s aggregate (up to ~80 on code/structured), prefill ~2070 tok/s, warm TTFT ~130 ms. Decode is power-bound at 275W (clocks throttle 3090 → ~2840 MHz under load), so the speed comes from a cheaper MoE kernel + speculation, not more power. First boot runs flashinfer autotune (~9 min); a persistent JIT-cache volume makes warm boots ~2.5 min.

Sampling: use repetition_penalty=1.12not 1.05, which reproduces deterministic loops on short structured outputs — and temperature=0 for structured/agentic work. This does not prevent loops on open-ended generation; see Repetition-loop attractors.

Do not use MAX_NUM_BATCHED_TOKENS=4096 for the 256k server. That setting produced deterministic exclamation-mark loops in the 96k-160k context band. Raising it to 8192 fixed the band without changing the checkpoint.

Validation

Endpoint probes passed with the settings above:

ProbeResult
ASCII JSONpass
Mathpass
Unicode echopass
Python codepass
Structured JSONpass
ASCII-only codepass
Tool callpass
Runtime vision smokepass
Longer code / JSON generationpass
Decode degeneracy sweeppass
Executable Python coding canarypass
Unicode math tablepass
Unicode chart code, subprocess-testedpass
Neutral philosophy explanation with Greek/CJK termspass
128k mixed-Unicode JSON/chart/reasoningpass
32k contextpass
64k contextpass
96k, 112k, 120k, 128k, 136k, 160kpass
180k, 200k, 225k, 250kpass
260k near-limit contextpass
128k structured JSON stabilitypass
200k Python code stabilitypass
250k Unicode/math stabilitypass

The runtime vision smoke used a generated image containing K2-192, SUM=166, a red square, and a blue square. The model recovered the text, number, and colored shapes correctly.

The decode degeneracy sweep generated medium-length outputs at short context, 128k, 200k, and 250k. The passing rerun had no repeated-character loop, repeated n-gram loop, duplicate-line loop, Unicode replacement character, or length finish. The 200k code case stopped naturally after 1038 completion tokens. Note: this sweep covers bounded / structured outputs; open-ended free-form generation still degenerates into repetition loops — see Repetition-loop attractors.

The executable coding canary asks for Python solutions and runs them in a subprocess. The passing rerun solved six tasks: two-sum indices, interval merge, record parsing, topological sort with cycle detection, Unicode slugify with Polish character mappings, and an LRU cache.

The Unicode/reasoning canary generated terminal-style Unicode charts, chart rendering code, comparative religion text, comparative philosophy text, and a 128k-token mixed-Unicode JSON response. It found no Unicode replacement characters, mojibake markers, illegal control characters, or true repetition loops. It did surface three copy-fidelity issues that are retained in the trace dataset: Warsaw became Warszaw/Warszawa, Check_Spark_192 became Check_SSpark_192, and Arabic توحيد was emitted as توحید with a Persian/Urdu yeh codepoint.

The near-limit long-context probe passed at 259,943 prompt tokens against the served 262,144-token context limit.

Gate repair was audited by comparing every pruned gate weight and e_score_correction_bias row against the original source checkpoint row chosen by the keep192 REAP plan. All 60 layers matched exactly with max absolute difference 0.0, and the config is repaired to n_routed_experts=192, num_experts_per_tok=8, n_group=1, and topk_group=1.

Trace rows and pruning artifacts are stored in:

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025},
eprint = {2510.13999},
archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

nvidia/Kimi-K2.6-NVFP4

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today