Compression Details
Table with columns: Field, Value| Field | Value |
|---|
| Weight dtype | NVFP4 (E2M1) |
| Weight group size | 16 |
| Scale dtype | FP8 E4M3, per-group |
| Global scale | FP32, per-tensor |
| Sparsity | Paired 4:8 (NVIDIA Blackwell) |
| Quantized + sparsified layers | Non-shared MoE experts (gate_proj, up_proj, down_proj) |
| Uncompressed | lm_head, self_attn.*, shared_experts.*, router, embeddings |
| Format | compressed-tensors (NVFP4PackedCompressor) |
Paired-4:8 sparsity. Every 8 contiguous elements form 4 pairs of 2; exactly 2 of the 4 pairs are nonzero. The zeroed positions are stored as FP4 zero codes inside weight_packed, so the sparsity structure is implicit — there is no separate bitmask tensor in the file.
Per-linear keys:
weight_packed — FP4 values, full K dimension
weight_scale — FP8 E4M3 per-16 group scales
weight_global_scale — FP32 per-tensor scale
How to Use
vLLM
The recipe below follows the upstream vLLM guide for Kimi-K2.5: https://recipes.vllm.ai/moonshotai/Kimi-K2.5. Refer to that page for advanced options (long context, prefix caching, structured output) and version-specific notes.
uv pip install -U vllm --torch-backend=auto
vllm serve ISTA-DASLab/Kimi-K2.5-P48NVFP4-Preview \
--tensor-parallel-size 4 \
--mm-encoder-tp-mode data \
--trust-remote-code \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2
Flag notes:
--tensor-parallel-size 4 — fits on 4× B200 (192 GB HBM3e each). Bump on smaller GPUs.
--trust-remote-code — required: Kimi-K2.5 ships custom modeling code.
--tool-call-parser kimi_k2, --reasoning-parser kimi_k2 — enable K2.5's tool-use and reasoning chat templates.
--mm-encoder-tp-mode data — multimodal encoder TP mode; harmless for text-only inference.
Then query the OpenAI-compatible endpoint at http://localhost:8000/v1.
Hardware
Evaluation — OpenLLM Leaderboard v1
All evaluations run with lm-evaluation-harness v0.4.11 against a vLLM 0.21.0 server on 4× B200.
Table with columns: Benchmark, Setup, Base (BF16), SparseGPT + GPTQ one-shot, This release, Δ (vs base)| Benchmark | Setup | Base (BF16) | SparseGPT + GPTQ one-shot | This release | Δ (vs base) |
|---|
| ARC-Challenge | acc_norm, 25-shot | 74.23 | 62.54 | 69.20 | −5.03 |
| HellaSwag | acc_norm, 10-shot | 91.86 | 84.90 | 88.71 | −3.15 |
| MMLU | acc, 5-shot |
Recovery: 79.31 / 82.51 = 96.12% of base-model average accuracy.
SparseGPT + GPTQ one-shot baseline. Reference point at the same compression target: SparseGPT picks the paired-4:8 mask, GPTQ quantizes the masked weights to NVFP4 (89.89% recovery).
For questions or open a discussion on this preview, please fill free to reach out to kwanhee.lee@postech.ac.kr.