Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Summary

This checkpoint is intended for deployment and inference. Unlike the QAT training checkpoint, this version stores the model in an AWQ-packed INT4 format, suitable for efficient inference with AWQ-compatible runtimes.

ItemDescription
ModelQwen3-VL-2B-GRACE-W4G128-AWQ
Base architectureQwen3-VL-2B-Instruct
TaskImage-text-to-text / multimodal instruction following
QuantizationINT4 weight-only AWQ
Group size128
FormatAWQ-packed Hugging Face checkpoint
Training frameworkGRACE
DatasetShareGPT4V
LicenseApache-2.0

About GRACE

GRACE is designed to compress vision-language models while preserving multimodal reasoning ability. The framework introduces three main components:

  1. Confidence-Gated Decoupled Knowledge Distillation
    GRACE uses teacher confidence to adaptively control the strength of distillation, reducing the influence of uncertain or noisy teacher predictions.

  2. Relational CKA Alignment
    GRACE aligns relational structures between teacher and student visual-token representations using centered kernel alignment.

  3. Adaptive Information Bottleneck Controller
    GRACE regulates representation compression through an adaptive information bottleneck objective, improving the balance between compactness and task performance.

For full technical details, please refer to the paper:

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li
arXiv:2601.22709

Checkpoint Format

This repository contains a packed AWQ deployment checkpoint.

Expected files include:

text

awq_quantized_modules.json
chat_template.jinja
config.json
generation_config.json
model.safetensors
processor_config.json
tokenizer.json
tokenizer_config.json
README.md

The important files are:

  • model.safetensors: AWQ-packed INT4 model weights
  • awq_quantized_modules.json: metadata for quantized modules
  • config.json: model architecture and quantization configuration
  • processor_config.json: multimodal processor configuration
  • tokenizer.json and tokenizer_config.json: tokenizer files
  • chat_template.jinja: chat template
  • generation_config.json: default generation configuration

Difference from the QAT Checkpoint

We release two related W4G128 versions:

RepositoryFormatIntended Use
Qwen3-VL-2B-GRACE-W4G128GRACE QAT checkpointResearch, further training, analysis
Qwen3-VL-2B-GRACE-W4G128-AWQAWQ-packed INT4 checkpointDeployment and efficient inference

The QAT checkpoint stores quantization-aware trained weights and learned quantization information.
This AWQ version is packed into real INT4 deployment tensors for inference.

Installation

Please use recent versions of transformers, accelerate, and safetensors.

bash

pip install -U transformers accelerate safetensors

For AWQ-compatible inference, install AutoAWQ:

bash

pip install autoawq

Depending on your CUDA / PyTorch environment, you may need to install a compatible AutoAWQ version manually.

Usage

Basic Loading

python

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W4G128-AWQ"
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True,
)
model.eval()

Image-Text Inference Example

python

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W4G128-AWQ"
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True,
)
image_path = "example.jpg"
image = Image.open(image_path).convert("RGB")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
)
generated_text = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_text[0])

Recommended Inference Settings

For deterministic evaluation:

python

generation_kwargs = {
"max_new_tokens": 256,
"do_sample": False,
}

For open-ended generation:

python

generation_kwargs = {
"max_new_tokens": 512,
"do_sample": True,
"temperature": 0.7,
"top_p": 0.9,
}

Evaluation

This checkpoint is part of the GRACE model family. The GRACE paper evaluates compressed VLMs on a broad set of multimodal benchmarks, including:

  • MMBench
  • SEED-Bench
  • ScienceQA
  • HallusionBench
  • AI2D
  • MMMU
  • MMStar

Please refer to the paper and the project repository for the full experimental setup and benchmark results.

Intended Use

This model is intended for research and deployment experiments in:

  • efficient vision-language models
  • multimodal instruction following
  • VLM quantization
  • knowledge distillation
  • low-bit inference
  • edge or resource-constrained multimodal AI systems

Limitations

This checkpoint inherits the limitations of the base Qwen3-VL model and the limitations of low-bit quantized VLMs.

Potential limitations include:

  • reduced accuracy compared with full-precision models on difficult reasoning tasks
  • sensitivity to prompt formatting and generation settings
  • possible hallucinations in visual question answering
  • possible OCR or fine-grained perception errors
  • possible bias inherited from the base model and training data
  • hardware and kernel compatibility issues depending on the AWQ runtime

Users should evaluate the model carefully before using it in high-stakes settings.

Ethical Considerations

This model should not be used for harmful, deceptive, or privacy-invasive applications.
The model may generate inaccurate or biased content, especially when used outside its intended research and deployment context.

For applications involving medical, legal, financial, or safety-critical decisions, human expert review is required.

License

This repository is released under the Apache-2.0 license.

Please also check the licenses and terms of use of the base model and datasets before downstream use.

Citation

If you find this model useful, please cite our paper:

bibtex

@article{chen2026grace,
title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
journal={arXiv preprint arXiv:2601.22709},
year={2026}
}

You may also cite the AWQ paper if you use or discuss the AWQ deployment format:

bibtex

@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
journal={arXiv preprint arXiv:2306.00978},
year={2023}
}

Acknowledgements

This release builds on the Qwen-VL model family, the Hugging Face Transformers ecosystem, ShareGPT4V, and AWQ-compatible low-bit inference tooling.

We thank the open-source community for making efficient multimodal model research and deployment possible.

Model provider

ForeverBlue

Model tree

Base

Qwen/Qwen3-VL-2B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today