Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

FieldValue
Base Modelllmfan46/gemma-4-31B-it-uncensored-heretic
ArchitectureGemma4ForConditionalGeneration (multimodal: text + vision)
Parameters31 B text decoder (quantized) + vision tower & embeddings kept in BF16
QuantizationW8A8 INT8 (per-channel weight + per-token dynamic activation)
QuantizerAMD Quark 0.11.2 (ptpc_int8 scheme, pack_method='order')
Model Size~33.3 GB (2 safetensors shards)
Original Size~62.5 GB (BF16 source checkpoint)
Compression~1.9x size reduction

Quantization Scheme

ComponentdtypeGranularityMode
WeightINT8per-channel (ch_axis=0)symmetric, static
ActivationINT8per-token (ch_axis=1)symmetric, dynamic
lm_headBF16unquantized
embed_tokensBF16unquantized
vision_tower / embed_visionBF16unquantized (multimodal preserved)

Accuracy

GSM8K 8-shot evaluation on the full 1319-question test split using vLLM's OpenAI-compatible chat API (temperature=0, concurrency=16, max_tokens=512, standard chat template, and final-answer format #### <answer>). Values are only populated when this card is regenerated with EVAL_BASELINE_JSON and/or EVAL_QUANT_JSON from eval_gsm8k_chat_vllm.py.

ModelSchemeAccuracyCorrect
llmfan46/gemma-4-31B-it-uncensored-heretic (BF16 baseline)not measured by this script
This model (Quark W8A8 INT8)per-channel weight + per-token act.not measured by this script

How to Use

With vLLM (Recommended)

bash

# Generic vLLM serve command. Use TP=4 for 4x24GB RDNA3 cards.
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
--trust-remote-code
# On the local RDNA3/gfx1100 vLLM fork, use the optimized INT8 Triton path:
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
--trust-remote-code \
--linear-backend triton

bash

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8",
"messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
"max_tokens": 256,
"temperature": 0.7
}'

Hardware Requirements

  • Minimum for one GPU: a GPU with roughly >=48 GB VRAM for typical short-context serving.
  • 4x24 GB consumer RDNA3: use tensor parallelism, for example --tensor-parallel-size 4, and keep multimodal limits disabled for text-only serving as shown above.
  • For longer context or larger batches, increase tensor parallelism or reduce --max-model-len / batch settings.

Quantization Details

This model was quantized using AMD Quark's per-token per-channel INT8 scheme:

  • Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static.
  • Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time).
  • Excluded layers: lm_head, token embeddings, the full vision_tower, and embed_vision remain BF16.
  • Export: pack_method='order', weight_format='real_quantized', Quark metadata (quant_method='quark') → real INT8 weights with BF16/FP32 scales depending on loader/export path; no fake-quant weights are stored.
  • Quantization path: CPU file-to-file quantization with no calibration data and no full model graph load.

Reproduce Quantization

bash

pip install amd-quark==0.11.1 huggingface_hub safetensors
HIP_VISIBLE_DEVICES="" ROCR_VISIBLE_DEVICES="" CUDA_VISIBLE_DEVICES="" \
MODEL_IN=llmfan46/gemma-4-31B-it-uncensored-heretic \
MODEL_OUT=./Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
HF_REPO_ID=valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
python quantize_gemma4_heretic_quark_w8a8_cpu.py

The script builds the Quark config explicitly:

python

weight: INT8, per_channel, ch_axis=0, symmetric=True, is_dynamic=False
input: INT8, per_channel, ch_axis=1, symmetric=True, is_dynamic=True

and then runs:

python

quantizer.direct_quantize_checkpoint(
pretrained_model_path=MODEL_IN,
save_path=MODEL_OUT,
device="cpu",
)

Citation

If you use this model, please cite the original Gemma 4 release and the upstream source model:

bibtex

@misc{google2026gemma4,
title = {Gemma 4},
author = {Google DeepMind},
year = {2026},
url = {https://huggingface.co/google/gemma-4-31B-it}
}

Upstream model: llmfan46/gemma-4-31B-it-uncensored-heretic.

License

This quantized derivative follows the license metadata of the upstream model, which is listed as Apache 2.0 / Gemma 4 license metadata on Hugging Face. Check the upstream model card and any included LICENSE / NOTICE files for complete terms.

Modified files include the INT8-quantized safetensors checkpoint and the appended Quark quantization_config block in config.json. No warranty of any kind is provided.

Model provider

valoomba

valoomba

Model tree

Base

llmfan46/gemma-4-31B-it-uncensored-heretic

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today