Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Benchmark (GB10 / DGX Spark, vLLM 0.22.1 native, single-stream decode, warm)
| Format | Disk | tok/s (EN/ZH) | Omni |
|---|---|---|---|
| BF16 | 23 GB | 7.7 | yes |
| FP8 dynamic (this) | 13 GB | 15.9 | yes |
| NVFP4 W4A16 | 7.7 GB | 24.9 | yes |
If you want the smallest + fastest build, see the sibling NVFP4 weight-only repo. FP8 is the conservative choice (dynamic activations, no calibration, widest kernel support).
Accuracy (MMLU + TMMLU+) — near-lossless on both languages
I scored all three formats on MMLU (English, 57 subjects) and TMMLU+ (Traditional Chinese, 66 subjects) with lm-evaluation-harness, 5-shot, chat template applied, limit=30 (N ≈ 1,710 EN / 1,980 TC, ±~1.0 pt), through transformers:
| Format | MMLU (EN) | TMMLU+ (TC) | EN drop | TC drop |
|---|---|---|---|---|
| BF16 | 78.30% | 47.21% | — | — |
| FP8 dynamic (this) | 77.95% | 46.97% | −0.35 | −0.24 |
| NVFP4 W4A16 | 75.56% | 41.24% | −2.74 | −5.97 |
FP8 is the accuracy-preserving choice. Near-lossless on both languages (within ~0.4 pt) and symmetric — while weight-only NVFP4 drops Traditional Chinese by ~6 points. If you can spare the extra disk/bandwidth over NVFP4 and care about non-English quality, FP8 is the safer pick. limit=30, single model — indicative. Full writeup.
Quantization recipe
llmcompressor, scheme FP8_DYNAMIC, data-free (no calibration data). Ignore list keeps the head and the multimodal projectors in BF16:
python
QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC",ignore=["lm_head", "re:.*embed_vision.*", "re:.*embed_audio.*"])
Serving (vLLM)
Needs vLLM with native Gemma4UnifiedForConditionalGeneration (~0.22.x / main) and the TRITON_ATTN backend (Gemma 4 has heterogeneous head dims: head_dim 256 x 16 = 4096 != hidden 3840):
bash
VLLM_ATTENTION_BACKEND=TRITON_ATTN \vllm serve coolthor/gemma-4-12B-it-FP8-dynamic --max-model-len 4096
Environment (exact versions — this model is version-sensitive)
This is a brand-new arch on a brand-new GPU, so the toolchain matters more than usual. The versions I actually ran:
| Component | Version | Why it matters |
|---|---|---|
| vLLM | 0.22.1rc1.dev124 (main, post-PR) | Needs the native Gemma4UnifiedForConditionalGeneration class, which only landed around 0.22.x/main. On an older vLLM it falls back to the generic transformers backend, which mishandles Gemma 4's non-square attention and crashes on o_proj. |
| transformers | 5.10.1 | First release that knows model_type: gemma4_unified. Older transformers can't even load the config. |
| torch | 2.11.0+cu130 | The one that bit me. vLLM main pins torch==2.10, but its _C.abi3.so was compiled against 2.11+cu130 — installing the pinned 2.10 (and pip silently pulling the CPU wheel on arm64) gives an undefined symbol import error and a CPU-only build. Force-align: pip install --force-reinstall --no-deps torch==2.11.0 --index-url https://download.pytorch.org/whl/cu130. |
| compressed-tensors | bundled with llmcompressor | Reads the FP8 weight format. |
| GPU / arch | DGX Spark GB10, sm_121a, CUDA 13.x | The torch-ABI dance above is specific to building vLLM from source for sm_121. |
| Attention backend | VLLM_ATTENTION_BACKEND=TRITON_ATTN | Required, not optional — see the head-dim note above. |
On a normal CUDA GPU (Hopper/Ada/Blackwell desktop) you don't need the torch-ABI overlay — that pain is specific to building vLLM from source for sm_121. A recent pip install vllm (with the native Gemma4Unified class) plus VLLM_ATTENTION_BACKEND=TRITON_ATTN is enough. FP8 is data-free, so the quantization recipe above is the whole reproduce step — no calibration set needed.
Validation (GB10, transformers)
- Text: coherent EN + ZH.
- Image: accurately described a studio-podcast photo (cat + Shiba Inu, headphones, studio mics, latte-art mug).
- Audio: understood a LibriSpeech clip (Mr. Quilter / apostle / middle classes).
- Video: correctly described a night-street clip.
Credits
- Base model:
google/gemma-4-12B-it(Apache 2.0, Google DeepMind) - Quantization:
llmcompressor+compressed-tensors - Quantized & benchmarked by coolthor on a DGX Spark (GB10)
Support
One-person effort on a single DGX Spark, no sponsor. If it saved you time, a coffee ☕ is appreciated.
Model provider
coolthor
Model tree
Base
google/gemma-4-12B-it
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information