Shankara-A-S
g4e4-it-v0
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What this is
google/gemma-4-E4B-it with weights stored in bf16 and quantized to 4-bit
NF4 at load time by vLLM's BitsAndBytes integration
(load-format: bitsandbytes, quantization: bitsandbytes). No weights were
modified; compression happens entirely at runtime.
Official Round 1 results (organizer-measured, NVIDIA L4)
| Model | Energy (J) | Doc analysis | Image understanding | Mean recovery |
|---|---|---|---|---|
| BF16 base | 99.71 | 0.7608 | 0.68 | 100% |
| This artifact | 113.41 (+13.7%) | 0.7576 | 0.56 (−17.6%) | 91.45% |
The compressed model used more energy than the uncompressed base.
Why — the lesson this repo exists to teach
- Runtime dequantization is an energy trap. BnB dequantizes 4-bit tiles to higher precision on every attention and MLP forward. The compute spent unpacking exceeds the bandwidth saved by smaller weights. Stored-weight formats with fused int4 kernels (GPTQ-Marlin / AWQ-Marlin) do the matmul directly on packed weights and actually save energy (−52% in our Round 2 artifacts on identical hardware).
- NF4 hurts multimodal composition. The vision tower stays bf16, but the LM layers that compose vision-token embeddings are NF4-quantized; document OCR (mostly text decoding) survived, visual reasoning dropped 17.6%.
Our Round 2 artifacts fix both:
g4e4-it-r2-awq-smoke-v0
(primary — AWQ-Marlin full decoder + response-economy chat template, ~4–5×
less energy than this repo at higher recovery) and
g4e4-it-r2-w4a16-mlpo-v0
(GPTQ-Marlin over MLP and attention-output projections — the conservative
alternative).
Usage (reproduction only)
bash
vllm serve Shankara-A-S/g4e4-it-v0 --config vllm_config.yaml
Tested on vLLM 0.20.2. Sampling: temperature=1.0, top_p=0.95, top_k=64
(also in generation_config.json).
License
Apache 2.0, inherited from google/gemma-4-E4B-it.
Model provider
Shankara-A-S
Model tree
Base
google/gemma-4-E4B-it
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information