coolthor/gemma-4-12B-it-FP8-dynamic API & Inference Endpoint

Benchmark (GB10 / DGX Spark, vLLM 0.22.1 native, single-stream decode, warm)

Format	Disk	tok/s (EN/ZH)	Omni
BF16	23 GB	7.7	yes
FP8 dynamic (this)	13 GB	15.9	yes
NVFP4 W4A16	7.7 GB	24.9	yes

If you want the smallest + fastest build, see the sibling NVFP4 weight-only repo. FP8 is the conservative choice (dynamic activations, no calibration, widest kernel support).

Accuracy (MMLU + TMMLU+) — near-lossless on both languages

I scored all three formats on MMLU (English, 57 subjects) and TMMLU+ (Traditional Chinese, 66 subjects) with lm-evaluation-harness, 5-shot, chat template applied, limit=30 (N ≈ 1,710 EN / 1,980 TC, ±~1.0 pt), through transformers:

Format	MMLU (EN)	TMMLU+ (TC)	EN drop	TC drop
BF16	78.30%	47.21%	—	—
FP8 dynamic (this)	77.95%	46.97%	−0.35	−0.24
NVFP4 W4A16	75.56%	41.24%	−2.74	−5.97

FP8 is the accuracy-preserving choice. Near-lossless on both languages (within ~0.4 pt) and symmetric — while weight-only NVFP4 drops Traditional Chinese by ~6 points. If you can spare the extra disk/bandwidth over NVFP4 and care about non-English quality, FP8 is the safer pick. limit=30, single model — indicative. Full writeup.

Quantization recipe

llmcompressor, scheme FP8_DYNAMIC, data-free (no calibration data). Ignore list keeps the head and the multimodal projectors in BF16:

python
QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*embed_vision.*", "re:.*embed_audio.*"])

Serving (vLLM)

Needs vLLM with native Gemma4UnifiedForConditionalGeneration (~0.22.x / main) and the TRITON_ATTN backend (Gemma 4 has heterogeneous head dims: head_dim 256 x 16 = 4096 != hidden 3840):

bash
VLLM_ATTENTION_BACKEND=TRITON_ATTN \
vllm serve coolthor/gemma-4-12B-it-FP8-dynamic --max-model-len 4096

Environment (exact versions — this model is version-sensitive)

This is a brand-new arch on a brand-new GPU, so the toolchain matters more than usual. The versions I actually ran:

Component	Version	Why it matters
vLLM	`0.22.1rc1.dev124` (main, post-PR)	Needs the native `Gemma4UnifiedForConditionalGeneration` class, which only landed around `0.22.x`/main. On an older vLLM it falls back to the generic transformers backend, which mishandles Gemma 4's non-square attention and crashes on `o_proj`.
transformers	`5.10.1`	First release that knows `model_type: gemma4_unified`. Older transformers can't even load the config.
torch	`2.11.0+cu130`	The one that bit me. vLLM main pins `torch==2.10`, but its `_C.abi3.so` was compiled against 2.11+cu130 — installing the pinned 2.10 (and pip silently pulling the CPU wheel on arm64) gives an `undefined symbol` import error and a CPU-only build. Force-align: `pip install --force-reinstall --no-deps torch==2.11.0 --index-url https://download.pytorch.org/whl/cu130`.
compressed-tensors	bundled with llmcompressor	Reads the FP8 weight format.
GPU / arch	DGX Spark GB10, `sm_121a`, CUDA 13.x	The torch-ABI dance above is specific to building vLLM from source for sm_121.
Attention backend	`VLLM_ATTENTION_BACKEND=TRITON_ATTN`	Required, not optional — see the head-dim note above.

On a normal CUDA GPU (Hopper/Ada/Blackwell desktop) you don't need the torch-ABI overlay — that pain is specific to building vLLM from source for sm_121. A recent pip install vllm (with the native Gemma4Unified class) plus VLLM_ATTENTION_BACKEND=TRITON_ATTN is enough. FP8 is data-free, so the quantization recipe above is the whole reproduce step — no calibration set needed.

Validation (GB10, transformers)

Text: coherent EN + ZH.
Image: accurately described a studio-podcast photo (cat + Shiba Inu, headphones, studio mics, latte-art mug).
Audio: understood a LibriSpeech clip (Mr. Quilter / apostle / middle classes).
Video: correctly described a night-street clip.

Credits

Base model: google/gemma-4-12B-it (Apache 2.0, Google DeepMind)
Quantization: llmcompressor + compressed-tensors
Quantized & benchmarked by coolthor on a DGX Spark (GB10)

Support

One-person effort on a single DGX Spark, no sponsor. If it saved you time, a coffee ☕ is appreciated.

gemma-4-12B-it-FP8-dynamic

Get help setting up a custom Dedicated Endpoints.

README