Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
| Field | Value |
|---|---|
| Base Model | llmfan46/gemma-4-31B-it-uncensored-heretic |
| Architecture | Gemma4ForConditionalGeneration (multimodal: text + vision) |
| Parameters | 31 B text decoder (quantized) + vision tower & embeddings kept in BF16 |
| Quantization | W8A8 INT8 (per-channel weight + per-token dynamic activation) |
| Quantizer | AMD Quark 0.11.2 (ptpc_int8 scheme, pack_method='order') |
| Model Size | ~33.3 GB (2 safetensors shards) |
| Original Size | ~62.5 GB (BF16 source checkpoint) |
| Compression | ~1.9x size reduction |
Quantization Scheme
| Component | dtype | Granularity | Mode |
|---|---|---|---|
| Weight | INT8 | per-channel (ch_axis=0) | symmetric, static |
| Activation | INT8 | per-token (ch_axis=1) | symmetric, dynamic |
lm_head | BF16 | — | unquantized |
embed_tokens | BF16 | — | unquantized |
vision_tower / embed_vision | BF16 | — | unquantized (multimodal preserved) |
Accuracy
GSM8K 8-shot evaluation on the full 1319-question test split using vLLM's
OpenAI-compatible chat API (temperature=0, concurrency=16, max_tokens=512,
standard chat template, and final-answer format #### <answer>). Values are
only populated when this card is regenerated with EVAL_BASELINE_JSON and/or
EVAL_QUANT_JSON from eval_gsm8k_chat_vllm.py.
| Model | Scheme | Accuracy | Correct |
|---|---|---|---|
llmfan46/gemma-4-31B-it-uncensored-heretic (BF16 baseline) | — | not measured by this script | — |
| This model (Quark W8A8 INT8) | per-channel weight + per-token act. | not measured by this script | — |
How to Use
With vLLM (Recommended)
bash
# Generic vLLM serve command. Use TP=4 for 4x24GB RDNA3 cards.vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \--tensor-parallel-size 4 \--max-model-len 8192 \--gpu-memory-utilization 0.9 \--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \--trust-remote-code# On the local RDNA3/gfx1100 vLLM fork, use the optimized INT8 Triton path:vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \--tensor-parallel-size 4 \--max-model-len 8192 \--gpu-memory-utilization 0.9 \--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \--trust-remote-code \--linear-backend triton
bash
curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8","messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],"max_tokens": 256,"temperature": 0.7}'
Hardware Requirements
- Minimum for one GPU: a GPU with roughly >=48 GB VRAM for typical short-context serving.
- 4x24 GB consumer RDNA3: use tensor parallelism, for example
--tensor-parallel-size 4, and keep multimodal limits disabled for text-only serving as shown above. - For longer context or larger batches, increase tensor parallelism or reduce
--max-model-len/ batch settings.
Quantization Details
This model was quantized using AMD Quark's per-token per-channel INT8 scheme:
- Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static.
- Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time).
- Excluded layers:
lm_head, token embeddings, the fullvision_tower, andembed_visionremain BF16. - Export:
pack_method='order',weight_format='real_quantized', Quark metadata (quant_method='quark') → real INT8 weights with BF16/FP32 scales depending on loader/export path; no fake-quant weights are stored. - Quantization path: CPU file-to-file quantization with no calibration data and no full model graph load.
Reproduce Quantization
bash
pip install amd-quark==0.11.1 huggingface_hub safetensorsHIP_VISIBLE_DEVICES="" ROCR_VISIBLE_DEVICES="" CUDA_VISIBLE_DEVICES="" \MODEL_IN=llmfan46/gemma-4-31B-it-uncensored-heretic \MODEL_OUT=./Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \HF_REPO_ID=valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \python quantize_gemma4_heretic_quark_w8a8_cpu.py
The script builds the Quark config explicitly:
python
weight: INT8, per_channel, ch_axis=0, symmetric=True, is_dynamic=Falseinput: INT8, per_channel, ch_axis=1, symmetric=True, is_dynamic=True
and then runs:
python
quantizer.direct_quantize_checkpoint(pretrained_model_path=MODEL_IN,save_path=MODEL_OUT,device="cpu",)
Citation
If you use this model, please cite the original Gemma 4 release and the upstream source model:
bibtex
@misc{google2026gemma4,title = {Gemma 4},author = {Google DeepMind},year = {2026},url = {https://huggingface.co/google/gemma-4-31B-it}}
Upstream model: llmfan46/gemma-4-31B-it-uncensored-heretic.
License
This quantized derivative follows the license metadata of the upstream model,
which is listed as Apache 2.0 / Gemma 4 license metadata on Hugging Face. Check
the upstream model card and any included LICENSE / NOTICE files for complete
terms.
Modified files include the INT8-quantized safetensors checkpoint and the appended
Quark quantization_config block in config.json. No warranty of any kind is
provided.
Model provider
valoomba
Model tree
Base
llmfan46/gemma-4-31B-it-uncensored-heretic
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information