valoomba

Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Details

Table with columns: Field, Value
Field	Value
Base Model	`llmfan46/gemma-4-31B-it-uncensored-heretic`
Architecture	`Gemma4ForConditionalGeneration` (multimodal: text + vision)
Parameters	31 B text decoder (quantized) + vision tower & embeddings kept in BF16
Quantization	W8A8 INT8 (per-channel weight + per-token dynamic activation)
Quantizer	AMD Quark `0.11.2` (`ptpc_int8` scheme, `pack_method='order'`)
Model Size	~33.3 GB (2 safetensors shards)
Original Size	~62.5 GB (BF16 source checkpoint)
Compression	~1.9x size reduction

Quantization Scheme

Table with columns: Component, dtype, Granularity, Mode
Component	dtype	Granularity	Mode
Weight	INT8	per-channel (`ch_axis=0`)	symmetric, static
Activation	INT8	per-token (`ch_axis=1`)	symmetric, dynamic
`lm_head`	BF16	—	unquantized

Accuracy

GSM8K 8-shot evaluation on the full 1319-question test split using vLLM's OpenAI-compatible chat API (temperature=0, concurrency=16, max_tokens=512, standard chat template, and final-answer format #### <answer>). Values are only populated when this card is regenerated with EVAL_BASELINE_JSON and/or EVAL_QUANT_JSON from eval_gsm8k_chat_vllm.py.

Table with columns: Model, Scheme, Accuracy, Correct
Model	Scheme	Accuracy	Correct
`llmfan46/gemma-4-31B-it-uncensored-heretic` (BF16 baseline)	—	not measured by this script	—
This model (Quark W8A8 INT8)	per-channel weight + per-token act.	not measured by this script	—

How to Use

With vLLM (Recommended)

bash
# Generic vLLM serve command. Use TP=4 for 4x24GB RDNA3 cards.
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
    --trust-remote-code

# On the local RDNA3/gfx1100 vLLM fork, use the optimized INT8 Triton path:
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
    --trust-remote-code \
    --linear-backend triton

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8",
    "messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Hardware Requirements

Minimum for one GPU: a GPU with roughly >=48 GB VRAM for typical short-context serving.
4x24 GB consumer RDNA3: use tensor parallelism, for example --tensor-parallel-size 4, and keep multimodal limits disabled for text-only serving as shown above.
For longer context or larger batches, increase tensor parallelism or reduce --max-model-len / batch settings.

Quantization Details

This model was quantized using AMD Quark's per-token per-channel INT8 scheme:

Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static.
Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time).
Excluded layers: lm_head, token embeddings, the full vision_tower, and embed_vision remain BF16.
Export: pack_method='order', weight_format='real_quantized', Quark metadata (quant_method='quark') → real INT8 weights with BF16/FP32 scales depending on loader/export path; no fake-quant weights are stored.
Quantization path: CPU file-to-file quantization with no calibration data and no full model graph load.

Reproduce Quantization

bash
pip install amd-quark==0.11.1 huggingface_hub safetensors

HIP_VISIBLE_DEVICES="" ROCR_VISIBLE_DEVICES="" CUDA_VISIBLE_DEVICES="" \
MODEL_IN=llmfan46/gemma-4-31B-it-uncensored-heretic \
MODEL_OUT=./Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
HF_REPO_ID=valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
python quantize_gemma4_heretic_quark_w8a8_cpu.py

The script builds the Quark config explicitly:

python
weight: INT8, per_channel, ch_axis=0, symmetric=True, is_dynamic=False
input:  INT8, per_channel, ch_axis=1, symmetric=True, is_dynamic=True

and then runs:

python
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=MODEL_IN,
    save_path=MODEL_OUT,
    device="cpu",
)

Citation

If you use this model, please cite the original Gemma 4 release and the upstream source model:

bibtex
@misc{google2026gemma4,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2026},
  url    = {https://huggingface.co/google/gemma-4-31B-it}
}

Upstream model: llmfan46/gemma-4-31B-it-uncensored-heretic.

License

This quantized derivative follows the license metadata of the upstream model, which is listed as Apache 2.0 / Gemma 4 license metadata on Hugging Face. Check the upstream model card and any included LICENSE / NOTICE files for complete terms.

Modified files include the INT8-quantized safetensors checkpoint and the appended Quark quantization_config block in config.json. No warranty of any kind is provided.

Model provider

valoomba

Model tree

Base

llmfan46/gemma-4-31B-it-uncensored-heretic

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Table with columns: Field, Value
Field	Value
Base Model	`llmfan46/gemma-4-31B-it-uncensored-heretic`
Architecture	`Gemma4ForConditionalGeneration` (multimodal: text + vision)
Parameters	31 B text decoder (quantized) + vision tower & embeddings kept in BF16
Quantization	W8A8 INT8 (per-channel weight + per-token dynamic activation)
Quantizer	AMD Quark `0.11.2` (`ptpc_int8` scheme, `pack_method='order'`)
Model Size	~33.3 GB (2 safetensors shards)
Original Size	~62.5 GB (BF16 source checkpoint)
Compression	~1.9x size reduction

Quantization Scheme

Table with columns: Component, dtype, Granularity, Mode
Component	dtype	Granularity	Mode
Weight	INT8	per-channel (`ch_axis=0`)	symmetric, static
Activation	INT8	per-token (`ch_axis=1`)	symmetric, dynamic
`lm_head`	BF16	—	unquantized

Accuracy

Table with columns: Model, Scheme, Accuracy, Correct
Model	Scheme	Accuracy	Correct
`llmfan46/gemma-4-31B-it-uncensored-heretic` (BF16 baseline)	—	not measured by this script	—
This model (Quark W8A8 INT8)	per-channel weight + per-token act.	not measured by this script	—

How to Use

With vLLM (Recommended)

bash
# Generic vLLM serve command. Use TP=4 for 4x24GB RDNA3 cards.
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
    --trust-remote-code

# On the local RDNA3/gfx1100 vLLM fork, use the optimized INT8 Triton path:
vllm serve valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
    --trust-remote-code \
    --linear-backend triton

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8",
    "messages": [{"role": "user", "content": "Hello! What is the capital of France?"}],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Hardware Requirements

Minimum for one GPU: a GPU with roughly >=48 GB VRAM for typical short-context serving.
4x24 GB consumer RDNA3: use tensor parallelism, for example --tensor-parallel-size 4, and keep multimodal limits disabled for text-only serving as shown above.
For longer context or larger batches, increase tensor parallelism or reduce --max-model-len / batch settings.

Quantization Details

This model was quantized using AMD Quark's per-token per-channel INT8 scheme:

Weight quantization: INT8 per-channel (one scale per output channel), symmetric, static.
Activation quantization: INT8 per-token (one scale per token), symmetric, dynamic (computed at inference time).
Excluded layers: lm_head, token embeddings, the full vision_tower, and embed_vision remain BF16.
Export: pack_method='order', weight_format='real_quantized', Quark metadata (quant_method='quark') → real INT8 weights with BF16/FP32 scales depending on loader/export path; no fake-quant weights are stored.
Quantization path: CPU file-to-file quantization with no calibration data and no full model graph load.

Reproduce Quantization

bash
pip install amd-quark==0.11.1 huggingface_hub safetensors

HIP_VISIBLE_DEVICES="" ROCR_VISIBLE_DEVICES="" CUDA_VISIBLE_DEVICES="" \
MODEL_IN=llmfan46/gemma-4-31B-it-uncensored-heretic \
MODEL_OUT=./Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
HF_REPO_ID=valoomba/Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8 \
python quantize_gemma4_heretic_quark_w8a8_cpu.py

The script builds the Quark config explicitly:

python
weight: INT8, per_channel, ch_axis=0, symmetric=True, is_dynamic=False
input:  INT8, per_channel, ch_axis=1, symmetric=True, is_dynamic=True

and then runs:

python
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=MODEL_IN,
    save_path=MODEL_OUT,
    device="cpu",
)

Citation

If you use this model, please cite the original Gemma 4 release and the upstream source model:

bibtex
@misc{google2026gemma4,
  title  = {Gemma 4},
  author = {Google DeepMind},
  year   = {2026},
  url    = {https://huggingface.co/google/gemma-4-31B-it}
}

Upstream model: llmfan46/gemma-4-31B-it-uncensored-heretic.

License

Modified files include the INT8-quantized safetensors checkpoint and the appended Quark quantization_config block in config.json. No warranty of any kind is provided.

Gemma-4-31B-it-uncensored-heretic-Quark-W8A8-INT8

Get help setting up a custom Dedicated Endpoints.

README

Model Details

Quantization Scheme

Accuracy

How to Use

With vLLM (Recommended)

Hardware Requirements

Quantization Details

Reproduce Quantization

Citation

License

Explore FriendliAI today

README

Model Details

Quantization Scheme

Accuracy

How to Use

With vLLM (Recommended)

Hardware Requirements

Quantization Details

Reproduce Quantization

Citation

License