ForeverBlue

Qwen3-VL-2B-GRACE-W4G128-AWQ

README

License: apache-2.0

Model Summary

This checkpoint is intended for deployment and inference. Unlike the QAT training checkpoint, this version stores the model in an AWQ-packed INT4 format, suitable for efficient inference with AWQ-compatible runtimes.

Table with columns: Item, Description
Item	Description
Model	Qwen3-VL-2B-GRACE-W4G128-AWQ
Base architecture	Qwen3-VL-2B-Instruct
Task	Image-text-to-text / multimodal instruction following
Quantization	INT4 weight-only AWQ
Group size	128
Format	AWQ-packed Hugging Face checkpoint
Training framework	GRACE
Dataset	ShareGPT4V
License	Apache-2.0

About GRACE

GRACE is designed to compress vision-language models while preserving multimodal reasoning ability. The framework introduces three main components:

Confidence-Gated Decoupled Knowledge Distillation
GRACE uses teacher confidence to adaptively control the strength of distillation, reducing the influence of uncertain or noisy teacher predictions.
Relational CKA Alignment
GRACE aligns relational structures between teacher and student visual-token representations using centered kernel alignment.
Adaptive Information Bottleneck Controller
GRACE regulates representation compression through an adaptive information bottleneck objective, improving the balance between compactness and task performance.

For full technical details, please refer to the paper:

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li
arXiv:2601.22709

Checkpoint Format

This repository contains a packed AWQ deployment checkpoint.

Expected files include:

text
awq_quantized_modules.json
chat_template.jinja
config.json
generation_config.json
model.safetensors
processor_config.json
tokenizer.json
tokenizer_config.json
README.md

The important files are:

model.safetensors: AWQ-packed INT4 model weights
awq_quantized_modules.json: metadata for quantized modules
config.json: model architecture and quantization configuration
processor_config.json: multimodal processor configuration
tokenizer.json and tokenizer_config.json: tokenizer files
chat_template.jinja: chat template
generation_config.json: default generation configuration

Difference from the QAT Checkpoint

We release two related W4G128 versions:

Table with columns: Repository, Format, Intended Use
Repository	Format	Intended Use
`Qwen3-VL-2B-GRACE-W4G128`	GRACE QAT checkpoint	Research, further training, analysis
`Qwen3-VL-2B-GRACE-W4G128-AWQ`	AWQ-packed INT4 checkpoint	Deployment and efficient inference

The QAT checkpoint stores quantization-aware trained weights and learned quantization information.
This AWQ version is packed into real INT4 deployment tensors for inference.

Installation

Please use recent versions of transformers, accelerate, and safetensors.

bash
pip install -U transformers accelerate safetensors

For AWQ-compatible inference, install AutoAWQ:

bash
pip install autoawq

Depending on your CUDA / PyTorch environment, you may need to install a compatible AutoAWQ version manually.

Usage

Basic Loading

python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W4G128-AWQ"

processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

model.eval()

Image-Text Inference Example

python
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W4G128-AWQ"

processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

image_path = "example.jpg"
image = Image.open(image_path).convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
    )

generated_text = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_text[0])

Recommended Inference Settings

For deterministic evaluation:

python
generation_kwargs = {
    "max_new_tokens": 256,
    "do_sample": False,
}

For open-ended generation:

python
generation_kwargs = {
    "max_new_tokens": 512,
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.9,
}

Evaluation

This checkpoint is part of the GRACE model family. The GRACE paper evaluates compressed VLMs on a broad set of multimodal benchmarks, including:

MMBench
SEED-Bench
ScienceQA
HallusionBench
AI2D
MMMU
MMStar

Please refer to the paper and the project repository for the full experimental setup and benchmark results.

Intended Use

This model is intended for research and deployment experiments in:

efficient vision-language models
multimodal instruction following
VLM quantization
knowledge distillation
low-bit inference
edge or resource-constrained multimodal AI systems

Limitations

This checkpoint inherits the limitations of the base Qwen3-VL model and the limitations of low-bit quantized VLMs.

Potential limitations include:

reduced accuracy compared with full-precision models on difficult reasoning tasks
sensitivity to prompt formatting and generation settings
possible hallucinations in visual question answering
possible OCR or fine-grained perception errors
possible bias inherited from the base model and training data
hardware and kernel compatibility issues depending on the AWQ runtime

Users should evaluate the model carefully before using it in high-stakes settings.

Ethical Considerations

This model should not be used for harmful, deceptive, or privacy-invasive applications.
The model may generate inaccurate or biased content, especially when used outside its intended research and deployment context.

For applications involving medical, legal, financial, or safety-critical decisions, human expert review is required.

License

This repository is released under the Apache-2.0 license.

Please also check the licenses and terms of use of the base model and datasets before downstream use.

Citation

If you find this model useful, please cite our paper:

bibtex
@article{chen2026grace,
  title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
  author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
  journal={arXiv preprint arXiv:2601.22709},
  year={2026}
}

You may also cite the AWQ paper if you use or discuss the AWQ deployment format:

bibtex
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2306.00978},
  year={2023}
}

Acknowledgements

This release builds on the Qwen-VL model family, the Hugging Face Transformers ecosystem, ShareGPT4V, and AWQ-compatible low-bit inference tooling.

We thank the open-source community for making efficient multimodal model research and deployment possible.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.