Model Details
Table with columns: Field, Value| Field | Value |
|---|
| Base Model | llmfan46/gemma-4-31B-it-uncensored-heretic |
| Architecture | Gemma4ForConditionalGeneration (multimodal: text + vision) |
| Parameters | 31 B text decoder (quantized) + vision tower & embeddings kept in BF16 |
| Quantization | W8A8 INT8 (per-channel weight + per-token dynamic activation) |
| Quantizer | AMD Quark 0.11.2 (ptpc_int8 scheme, pack_method='order') |
| Model Size | ~33.3 GB (2 safetensors shards) |
| Original Size | ~62.5 GB (BF16 source checkpoint) |
| Compression | ~1.9x size reduction |
Quantization Scheme
Table with columns: Component, dtype, Granularity, Mode| Component | dtype | Granularity | Mode |
|---|
| Weight | INT8 | per-channel (ch_axis=0) | symmetric, static |
| Activation | INT8 | per-token (ch_axis=1) | symmetric, dynamic |
lm_head | BF16 | — | unquantized |
|
Accuracy
GSM8K 8-shot evaluation on the full 1319-question test split using vLLM's
OpenAI-compatible chat API (temperature=0, concurrency=16, max_tokens=512,
standard chat template, and final-answer format #### <answer>). Values are
only populated when this card is regenerated with EVAL_BASELINE_JSON and/or
EVAL_QUANT_JSON from eval_gsm8k_chat_vllm.py.
Table with columns: Model, Scheme, Accuracy, Correct| Model | Scheme | Accuracy | Correct |
|---|
llmfan46/gemma-4-31B-it-uncensored-heretic (BF16 baseline) | — | not measured by this script | — |
| This model (Quark W8A8 INT8) | per-channel weight + per-token act. | not measured by this script | — |
How to Use
With vLLM (Recommended)
# Generic vLLM serve command. Use TP=4 for 4x24GB RDNA3 cards.
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
--trust-remote-code
# On the local RDNA3/gfx1100 vLLM fork, use the optimized INT8 Triton path:
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
--trust-remote-code \
--linear-backend triton
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8",
"messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
"max_tokens": 256,
"temperature": 0.7
}'
Hardware Requirements
- Minimum for one GPU: a GPU with roughly >=48 GB VRAM for typical short-context
serving.
- 4x24 GB consumer RDNA3: use tensor parallelism, for example
--tensor-parallel-size 4, and keep multimodal limits disabled for text-only
serving as shown above.
- For longer context or larger batches, increase tensor parallelism or reduce
--max-model-len / batch settings.
Quantization Details
This model was quantized using AMD Quark's per-token per-channel INT8 scheme:
- Weight quantization: INT8 per-channel (one scale per output channel),
symmetric, static.
- Activation quantization: INT8 per-token (one scale per token), symmetric,
dynamic (computed at inference time).
- Excluded layers:
lm_head, token embeddings, the full vision_tower, and
embed_vision remain BF16.
- Export:
pack_method='order', weight_format='real_quantized', Quark
metadata (quant_method='quark') → real INT8 weights with BF16/FP32 scales
depending on loader/export path; no fake-quant weights are stored.
- Quantization path: CPU file-to-file quantization with no calibration data and
no full model graph load.
Reproduce Quantization
pip install amd-quark==0.11.1 huggingface_hub safetensors
HIP_VISIBLE_DEVICES="" ROCR_VISIBLE_DEVICES="" CUDA_VISIBLE_DEVICES="" \
MODEL_IN=llmfan46/gemma-4-31B-it-uncensored-heretic \
MODEL_OUT=./Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
HF_REPO_ID=valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
python quantize_gemma4_heretic_quark_w8a8_cpu.py
The script builds the Quark config explicitly:
weight: INT8, per_channel, ch_axis=0, symmetric=True, is_dynamic=False
input: INT8, per_channel, ch_axis=1, symmetric=True, is_dynamic=True
and then runs:
quantizer.direct_quantize_checkpoint(
pretrained_model_path=MODEL_IN,
save_path=MODEL_OUT,
device="cpu",
)
Citation
If you use this model, please cite the original Gemma 4 release and the upstream
source model:
@misc{google2026gemma4,
title = {Gemma 4},
author = {Google DeepMind},
year = {2026},
url = {https://huggingface.co/google/gemma-4-31B-it}
}
Upstream model: llmfan46/gemma-4-31B-it-uncensored-heretic.
License
This quantized derivative follows the license metadata of the upstream model,
which is listed as Apache 2.0 / Gemma 4 license metadata on Hugging Face. Check
the upstream model card and any included LICENSE / NOTICE files for complete
terms.
Modified files include the INT8-quantized safetensors checkpoint and the appended
Quark quantization_config block in config.json. No warranty of any kind is
provided.