Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Summary
This checkpoint is intended for deployment and inference. Unlike the QAT training checkpoint, this version stores the model in an AWQ-packed INT4 format, suitable for efficient inference with AWQ-compatible runtimes.
| Item | Description |
|---|---|
| Model | Qwen3-VL-2B-GRACE-W4G128-AWQ |
| Base architecture | Qwen3-VL-2B-Instruct |
| Task | Image-text-to-text / multimodal instruction following |
| Quantization | INT4 weight-only AWQ |
| Group size | 128 |
| Format | AWQ-packed Hugging Face checkpoint |
| Training framework | GRACE |
| Dataset | ShareGPT4V |
| License | Apache-2.0 |
About GRACE
GRACE is designed to compress vision-language models while preserving multimodal reasoning ability. The framework introduces three main components:
-
Confidence-Gated Decoupled Knowledge Distillation
GRACE uses teacher confidence to adaptively control the strength of distillation, reducing the influence of uncertain or noisy teacher predictions. -
Relational CKA Alignment
GRACE aligns relational structures between teacher and student visual-token representations using centered kernel alignment. -
Adaptive Information Bottleneck Controller
GRACE regulates representation compression through an adaptive information bottleneck objective, improving the balance between compactness and task performance.
For full technical details, please refer to the paper:
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li
arXiv:2601.22709
Checkpoint Format
This repository contains a packed AWQ deployment checkpoint.
Expected files include:
text
awq_quantized_modules.jsonchat_template.jinjaconfig.jsongeneration_config.jsonmodel.safetensorsprocessor_config.jsontokenizer.jsontokenizer_config.jsonREADME.md
The important files are:
model.safetensors: AWQ-packed INT4 model weightsawq_quantized_modules.json: metadata for quantized modulesconfig.json: model architecture and quantization configurationprocessor_config.json: multimodal processor configurationtokenizer.jsonandtokenizer_config.json: tokenizer fileschat_template.jinja: chat templategeneration_config.json: default generation configuration
Difference from the QAT Checkpoint
We release two related W4G128 versions:
| Repository | Format | Intended Use |
|---|---|---|
Qwen3-VL-2B-GRACE-W4G128 | GRACE QAT checkpoint | Research, further training, analysis |
Qwen3-VL-2B-GRACE-W4G128-AWQ | AWQ-packed INT4 checkpoint | Deployment and efficient inference |
The QAT checkpoint stores quantization-aware trained weights and learned quantization information.
This AWQ version is packed into real INT4 deployment tensors for inference.
Installation
Please use recent versions of transformers, accelerate, and safetensors.
bash
pip install -U transformers accelerate safetensors
For AWQ-compatible inference, install AutoAWQ:
bash
pip install autoawq
Depending on your CUDA / PyTorch environment, you may need to install a compatible AutoAWQ version manually.
Usage
Basic Loading
python
import torchfrom transformers import AutoProcessor, AutoModelForImageTextToTextmodel_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W4G128-AWQ"processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True,)model = AutoModelForImageTextToText.from_pretrained(model_id,device_map="auto",torch_dtype=torch.float16,trust_remote_code=True,)model.eval()
Image-Text Inference Example
python
from PIL import Imageimport torchfrom transformers import AutoProcessor, AutoModelForImageTextToTextmodel_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W4G128-AWQ"processor = AutoProcessor.from_pretrained(model_id,trust_remote_code=True,)model = AutoModelForImageTextToText.from_pretrained(model_id,device_map="auto",torch_dtype=torch.float16,trust_remote_code=True,)image_path = "example.jpg"image = Image.open(image_path).convert("RGB")messages = [{"role": "user","content": [{"type": "image", "image": image},{"type": "text", "text": "Describe this image in detail."},],}]text = processor.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)inputs = processor(text=[text],images=[image],return_tensors="pt",).to(model.device)with torch.no_grad():generated_ids = model.generate(**inputs,max_new_tokens=256,do_sample=False,)generated_text = processor.batch_decode(generated_ids,skip_special_tokens=True,)print(generated_text[0])
Recommended Inference Settings
For deterministic evaluation:
python
generation_kwargs = {"max_new_tokens": 256,"do_sample": False,}
For open-ended generation:
python
generation_kwargs = {"max_new_tokens": 512,"do_sample": True,"temperature": 0.7,"top_p": 0.9,}
Evaluation
This checkpoint is part of the GRACE model family. The GRACE paper evaluates compressed VLMs on a broad set of multimodal benchmarks, including:
- MMBench
- SEED-Bench
- ScienceQA
- HallusionBench
- AI2D
- MMMU
- MMStar
Please refer to the paper and the project repository for the full experimental setup and benchmark results.
Intended Use
This model is intended for research and deployment experiments in:
- efficient vision-language models
- multimodal instruction following
- VLM quantization
- knowledge distillation
- low-bit inference
- edge or resource-constrained multimodal AI systems
Limitations
This checkpoint inherits the limitations of the base Qwen3-VL model and the limitations of low-bit quantized VLMs.
Potential limitations include:
- reduced accuracy compared with full-precision models on difficult reasoning tasks
- sensitivity to prompt formatting and generation settings
- possible hallucinations in visual question answering
- possible OCR or fine-grained perception errors
- possible bias inherited from the base model and training data
- hardware and kernel compatibility issues depending on the AWQ runtime
Users should evaluate the model carefully before using it in high-stakes settings.
Ethical Considerations
This model should not be used for harmful, deceptive, or privacy-invasive applications.
The model may generate inaccurate or biased content, especially when used outside its intended research and deployment context.
For applications involving medical, legal, financial, or safety-critical decisions, human expert review is required.
License
This repository is released under the Apache-2.0 license.
Please also check the licenses and terms of use of the base model and datasets before downstream use.
Citation
If you find this model useful, please cite our paper:
bibtex
@article{chen2026grace,title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},journal={arXiv preprint arXiv:2601.22709},year={2026}}
You may also cite the AWQ paper if you use or discuss the AWQ deployment format:
bibtex
@article{lin2023awq,title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},journal={arXiv preprint arXiv:2306.00978},year={2023}}
Acknowledgements
This release builds on the Qwen-VL model family, the Hugging Face Transformers ecosystem, ShareGPT4V, and AWQ-compatible low-bit inference tooling.
We thank the open-source community for making efficient multimodal model research and deployment possible.
Model provider
ForeverBlue
Model tree
Base
Qwen/Qwen3-VL-2B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information