Shankara-A-S

g4e4-it-v0

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What this is

google/gemma-4-E4B-it with weights stored in bf16 and quantized to 4-bit NF4 at load time by vLLM's BitsAndBytes integration (load-format: bitsandbytes, quantization: bitsandbytes). No weights were modified; compression happens entirely at runtime.

Official Round 1 results (organizer-measured, NVIDIA L4)

Table
ModelEnergy (J)Doc analysisImage understandingMean recovery
BF16 base99.710.76080.68100%
This artifact113.41 (+13.7%)0.75760.56 (−17.6%)91.45%

The compressed model used more energy than the uncompressed base.

Why — the lesson this repo exists to teach

  1. Runtime dequantization is an energy trap. BnB dequantizes 4-bit tiles to higher precision on every attention and MLP forward. The compute spent unpacking exceeds the bandwidth saved by smaller weights. Stored-weight formats with fused int4 kernels (GPTQ-Marlin / AWQ-Marlin) do the matmul directly on packed weights and actually save energy (−52% in our Round 2 artifacts on identical hardware).
  2. NF4 hurts multimodal composition. The vision tower stays bf16, but the LM layers that compose vision-token embeddings are NF4-quantized; document OCR (mostly text decoding) survived, visual reasoning dropped 17.6%.

Our Round 2 artifacts fix both: g4e4-it-r2-awq-smoke-v0 (primary — AWQ-Marlin full decoder + response-economy chat template, ~4–5× less energy than this repo at higher recovery) and g4e4-it-r2-w4a16-mlpo-v0 (GPTQ-Marlin over MLP and attention-output projections — the conservative alternative).

Usage (reproduction only)

bash

vllm serve Shankara-A-S/g4e4-it-v0 --config vllm_config.yaml

Tested on vLLM 0.20.2. Sampling: temperature=1.0, top_p=0.95, top_k=64 (also in generation_config.json).

License

Apache 2.0, inherited from google/gemma-4-E4B-it.

Model provider

Shankara-A-S

Model tree

Base

google/gemma-4-E4B-it

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today