ISTA-DASLab

Kimi-K2.5-P48NVFP4-Preview

README

License: modified-mit

Compression Details

Table with columns: Field, Value
Field	Value
Weight dtype	NVFP4 (E2M1)
Weight group size	16
Scale dtype	FP8 E4M3, per-group
Global scale	FP32, per-tensor
Sparsity	Paired 4:8 (NVIDIA Blackwell)
Quantized + sparsified layers	Non-shared MoE experts (`gate_proj`, `up_proj`, `down_proj`)
Uncompressed	`lm_head`, `self_attn.`, `shared_experts.`, router, embeddings
Format	`compressed-tensors` (`NVFP4PackedCompressor`)

Paired-4:8 sparsity. Every 8 contiguous elements form 4 pairs of 2; exactly 2 of the 4 pairs are nonzero. The zeroed positions are stored as FP4 zero codes inside weight_packed, so the sparsity structure is implicit — there is no separate bitmask tensor in the file.

Per-linear keys:

weight_packed — FP4 values, full K dimension
weight_scale — FP8 E4M3 per-16 group scales
weight_global_scale — FP32 per-tensor scale

How to Use

vLLM

The recipe below follows the upstream vLLM guide for Kimi-K2.5: https://recipes.vllm.ai/moonshotai/Kimi-K2.5. Refer to that page for advanced options (long context, prefix caching, structured output) and version-specific notes.

bash
uv pip install -U vllm --torch-backend=auto

vllm serve ISTA-DASLab/Kimi-K2.5-P48NVFP4-Preview \
    --tensor-parallel-size 4 \
    --mm-encoder-tp-mode data \
    --trust-remote-code \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2

Flag notes:

--tensor-parallel-size 4 — fits on 4× B200 (192 GB HBM3e each). Bump on smaller GPUs.
--trust-remote-code — required: Kimi-K2.5 ships custom modeling code.
--tool-call-parser kimi_k2, --reasoning-parser kimi_k2 — enable K2.5's tool-use and reasoning chat templates.
--mm-encoder-tp-mode data — multimodal encoder TP mode; harmless for text-only inference.

Then query the OpenAI-compatible endpoint at http://localhost:8000/v1.

Hardware

Tested on: 4xB200.

Evaluation — OpenLLM Leaderboard v1

All evaluations run with lm-evaluation-harness v0.4.11 against a vLLM 0.21.0 server on 4× B200.

Table with columns: Benchmark, Setup, Base (BF16), SparseGPT + GPTQ one-shot, This release, Δ (vs base)
Benchmark	Setup	Base (BF16)	SparseGPT + GPTQ one-shot	This release	Δ (vs base)
ARC-Challenge	acc_norm, 25-shot	74.23	62.54	69.20	−5.03
HellaSwag	acc_norm, 10-shot	91.86	84.90	88.71	−3.15
MMLU	acc, 5-shot

Recovery: 79.31 / 82.51 = 96.12% of base-model average accuracy.

SparseGPT + GPTQ one-shot baseline. Reference point at the same compression target: SparseGPT picks the paired-4:8 mask, GPTQ quantizes the masked weights to NVFP4 (89.89% recovery).

Contact

For questions or open a discussion on this preview, please fill free to reach out to kwanhee.lee@postech.ac.kr.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ISTA-DASLab

Model Tree

Base

moonshotai/Kimi-K2.5

Quantized

this model

Input Modalities

Text

Output Modalities