Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Why weight-only (W4A16), not W4A4

The "obvious" full NVFP4 (W4A4 = weight + activation both 4-bit) is worse on every axis for this model:

  • It breaks the multimodal capabilities — image/audio/video collapse to empty or garbled output. W4A4 quantizes activations using text-only calibration, so the image/audio embeddings are out-of-distribution and get clipped by the 4-bit activation range.
  • It is also slightly slower (23.9 vs 24.9 tok/s) — on a bandwidth-bound dense model, the weight-only dequant-to-BF16 path beats the W4A4 path.

So this build is weight-only NVFP4 (NVFP4A16): 4-bit weights, BF16 activations. Omni intact, faster, same file size.

Benchmark (GB10 / DGX Spark, vLLM 0.22.1 native, single-stream decode, warm)

FormatDisktok/s (EN/ZH)Omni
BF1623 GB7.7yes
FP8 dynamic13 GB15.9yes
NVFP4 W4A47.7 GB23.9broken
NVFP4 W4A16 (this)7.7 GB24.9yes

Note: in plain transformers (HF eager) all quantized formats run slower than BF16 because there is no native FP4/FP8 kernel — the speedups above are real only under vLLM.

Accuracy (MMLU + TMMLU+) — the honest tradeoff

Speed and size are not free. I scored all three formats on MMLU (English, 57 subjects) and TMMLU+ (Traditional Chinese, 66 subjects) with lm-evaluation-harness, 5-shot, chat template applied, limit=30 (N ≈ 1,710 EN / 1,980 TC, ±~1.0 pt), through transformers:

FormatMMLU (EN)TMMLU+ (TC)EN dropTC drop
BF1678.30%47.21%
FP8 dynamic77.95%46.97%−0.35−0.24
NVFP4 W4A16 (this)75.56%41.24%−2.74−5.97

The tax is real and uneven. This weight-only NVFP4 build costs ~2.7 points on English but ~6.0 on Traditional Chinese (−3.5% vs −12.6% relative) — the lower-resource language pays roughly twice as much. If your workload is English-heavy and you want the smallest/fastest build, that's a fair trade. If it's Traditional-Chinese-heavy, the sibling FP8 build is near-lossless on TC (−0.24). Single model, single recipe, limit=30 — indicative, not a universal law. Full writeup.

Quantization recipe

llmcompressor, scheme NVFP4A16, basic pipeline (the sequential pipeline hits a UserDict tracing error on this brand-new arch). The ignore list must match vLLM's native module quantization or the model will not load:

python

QuantizationModifier(targets="Linear", scheme="NVFP4A16",
ignore=["lm_head", "re:.*embedding_projection.*"])

i.e. quantize the text tower + the vision patch_dense; keep lm_head and both embedding_projections in BF16.

Serving (vLLM)

Needs vLLM with native Gemma4UnifiedForConditionalGeneration (~0.22.x / main) and the TRITON_ATTN backend — Gemma 4 has heterogeneous head dims (head_dim 256 x 16 heads = 4096 != hidden 3840) that other attention backends mishandle:

bash

VLLM_ATTENTION_BACKEND=TRITON_ATTN \
vllm serve coolthor/gemma-4-12B-it-NVFP4A16 --max-model-len 4096

Text and image serve through vLLM today; vLLM's generic multimodal wrapper is image-only for now, so full audio/video serving is pending upstream. All four modalities work through transformers.

Environment (exact versions — this model is version-sensitive)

This is a brand-new arch on a brand-new GPU, so the toolchain matters more than usual. The versions I actually ran:

ComponentVersionWhy it matters
vLLM0.22.1rc1.dev124 (main, post-PR)Needs the native Gemma4UnifiedForConditionalGeneration class, which only landed around 0.22.x/main. On an older vLLM it falls back to the generic transformers backend, which mishandles Gemma 4's non-square attention and crashes on o_proj.
transformers5.10.1First release that knows model_type: gemma4_unified. Older transformers can't even load the config.
torch2.11.0+cu130The one that bit me. vLLM main pins torch==2.10, but its _C.abi3.so was compiled against 2.11+cu130 — installing the pinned 2.10 (and pip silently pulling the CPU wheel on arm64) gives an undefined symbol import error and a CPU-only build. Force-align: pip install --force-reinstall --no-deps torch==2.11.0 --index-url https://download.pytorch.org/whl/cu130.
compressed-tensorsbundled with above llmcompressorReads the NVFP4 weight format.
GPU / archDGX Spark GB10, sm_121a, CUDA 13.xThe torch-ABI dance above is specific to building vLLM from source for sm_121.
Attention backendVLLM_ATTENTION_BACKEND=TRITON_ATTNRequired, not optional — see the head-dim note above.

On a normal CUDA GPU (Hopper/Ada/Blackwell desktop) you don't need the torch-ABI overlay — that pain is specific to building vLLM from source for sm_121. A recent pip install vllm (with the native Gemma4Unified class) plus VLLM_ATTENTION_BACKEND=TRITON_ATTN is enough.

The exact quantization config is in recipe.yaml in this repo (scheme + ignore list), so you can reproduce the build.

Validation (GB10, transformers)

  • Text: coherent EN + ZH.
  • Image: accurately described a studio-podcast photo (animals, headphones, studio mics, "ON AIR" sign, laptops).
  • Audio: transcribed a LibriSpeech clip — "Mr. Quilter is the apostle of the middle classes...".
  • Video: correctly described a night-street clip.

Credits

  • Base model: google/gemma-4-12B-it (Apache 2.0, Google DeepMind)
  • Quantization: llmcompressor + compressed-tensors
  • Quantized & benchmarked by coolthor on a DGX Spark (GB10)

Support

One-person effort on a single DGX Spark, no sponsor. If it saved you time, a coffee ☕ is appreciated.

Model provider

coolthor

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today