g4e4-it-v0 API & Inference Endpoint

What this is

google/gemma-4-E4B-it with weights stored in bf16 and quantized to 4-bit NF4 at load time by vLLM's BitsAndBytes integration (load-format: bitsandbytes, quantization: bitsandbytes). No weights were modified; compression happens entirely at runtime.

Official Round 1 results (organizer-measured, NVIDIA L4)

Table with columns: Model, Energy (J), Doc analysis, Image understanding, Mean recovery
Model	Energy (J)	Doc analysis	Image understanding	Mean recovery
BF16 base	99.71	0.7608	0.68	100%
This artifact	113.41 (+13.7%)	0.7576	0.56 (−17.6%)	91.45%

The compressed model used more energy than the uncompressed base.

Why — the lesson this repo exists to teach

Runtime dequantization is an energy trap. BnB dequantizes 4-bit tiles to higher precision on every attention and MLP forward. The compute spent unpacking exceeds the bandwidth saved by smaller weights. Stored-weight formats with fused int4 kernels (GPTQ-Marlin / AWQ-Marlin) do the matmul directly on packed weights and actually save energy (−52% in our Round 2 artifacts on identical hardware).
NF4 hurts multimodal composition. The vision tower stays bf16, but the LM layers that compose vision-token embeddings are NF4-quantized; document OCR (mostly text decoding) survived, visual reasoning dropped 17.6%.

Our Round 2 artifacts fix both: g4e4-it-r2-awq-smoke-v0 (primary — AWQ-Marlin full decoder + response-economy chat template, ~4–5× less energy than this repo at higher recovery) and g4e4-it-r2-w4a16-mlpo-v0 (GPTQ-Marlin over MLP and attention-output projections — the conservative alternative).

Usage (reproduction only)

bash
vllm serve Shankara-A-S/g4e4-it-v0 --config vllm_config.yaml

Tested on vLLM 0.20.2. Sampling: temperature=1.0, top_p=0.95, top_k=64 (also in generation_config.json).

License

Apache 2.0, inherited from google/gemma-4-E4B-it.

What this is

Official Round 1 results (organizer-measured, NVIDIA L4)

Table with columns: Model, Energy (J), Doc analysis, Image understanding, Mean recovery
Model	Energy (J)	Doc analysis	Image understanding	Mean recovery
BF16 base	99.71	0.7608	0.68	100%
This artifact	113.41 (+13.7%)	0.7576	0.56 (−17.6%)	91.45%

The compressed model used more energy than the uncompressed base.

Why — the lesson this repo exists to teach

Runtime dequantization is an energy trap. BnB dequantizes 4-bit tiles to higher precision on every attention and MLP forward. The compute spent unpacking exceeds the bandwidth saved by smaller weights. Stored-weight formats with fused int4 kernels (GPTQ-Marlin / AWQ-Marlin) do the matmul directly on packed weights and actually save energy (−52% in our Round 2 artifacts on identical hardware).
NF4 hurts multimodal composition. The vision tower stays bf16, but the LM layers that compose vision-token embeddings are NF4-quantized; document OCR (mostly text decoding) survived, visual reasoning dropped 17.6%.

Usage (reproduction only)

bash
vllm serve Shankara-A-S/g4e4-it-v0 --config vllm_config.yaml

Tested on vLLM 0.20.2. Sampling: temperature=1.0, top_p=0.95, top_k=64 (also in generation_config.json).

License

Apache 2.0, inherited from google/gemma-4-E4B-it.

g4e4-it-v0

README

What this is

Official Round 1 results (organizer-measured, NVIDIA L4)

Why — the lesson this repo exists to teach

Usage (reproduction only)

License

Explore FriendliAI today

README

What this is

Official Round 1 results (organizer-measured, NVIDIA L4)

Why — the lesson this repo exists to teach

Usage (reproduction only)

License