Model Details
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Method: GRACE
- Quantization: W8G128 group-wise INT8 QAT
- Training data: ShareGPT4V
- Training / evaluation protocol: LLaVA-style multimodal evaluation
- Library: Hugging Face Transformers
- Repository: ForeverBlue/Qwen3-VL-2B-GRACE-W8G128
📊 Results
Comparison on 7 VLM benchmarks. The 8B model is the distillation teacher
(reference upper bound); all GRACE-Qwen3 variants are 2B students. Best result
among the 2B Qwen3-VL models is in bold.
We release GRACE on Qwen3-VL here because it is the most current backbone and
gives a fairer, up-to-date point of comparison, with the vanilla
Qwen3-VL-2B-Instruct as the baseline. The paper itself reports GRACE on
LLaVA-1.5 and Qwen2-VL; we additionally release the LLaVA-1.5 W4G128 INT4
checkpoint from the paper in the model zoo below.
Table with columns: Model, Params, Precision, HallB, MMBench, ScienceQA, AI2D, MMMU, SEED, MMStar, Avg| Model | Params | Precision | HallB | MMBench | ScienceQA | AI2D | MMMU | SEED | MMStar | Avg |
|---|
| Qwen3-VL-8B (teacher, ref.) | 8B | BF16 | 61.1 | 84.5 | 85.0 | 85.7 | 69.6 | 77.5 | 70.9 | 76.3 |
| Qwen3-VL-2B (baseline) | 2B | BF16 | 51.4 | 78.4 | 81.4 | 76.9 | 53.4 | 71.2 | 58.3 | 67.3 |
|
GRACE lifts the Qwen3-VL-2B baseline by +9.4 avg and matches or slightly
exceeds the 8B teacher on average (76.7 vs. 76.3) at roughly 1/4 the
parameters. The W8G128 INT8 model retains 99% of the BF16 average.
🤗 Model Zoo
Table with columns: Model, Backbone, Bits, Group, Checkpoint description, HF Hub| Model | Backbone | Bits | Group | Checkpoint description | HF Hub |
|---|
| Qwen3-VL-2B-GRACE-BF16 | Qwen3-VL-2B | bf16 | — | Full-precision GRACE checkpoint; used as the student initialization for the W8/W4 Qwen3-VL runs. | FoeverBLUE/Qwen3-VL-2B-GRACE-BF16 |
| Qwen3-VL-2B-GRACE-W8G128 | Qwen3-VL-2B | int8 | 128 | INT8 QAT checkpoint with group size 128; high-retention quantized Qwen3-VL student. |
The BF16 Qwen3-VL checkpoint is the full-precision GRACE student used as the
initial student weights for the W8 and W4 Qwen3-VL runs. The LLaVA-1.5 W4G128
checkpoint corresponds to the paper setting and includes GRACE-specific QAT
quantized weights for reproducing the INT4 LLaVA experiments.
Intended Use
This model is intended for research purposes, including:
- Efficient vision-language models
- Quantization-aware training
- Low-bit multimodal model deployment
- Knowledge distillation for VLM compression
- Multimodal model efficiency studies
Out-of-Scope Use
This checkpoint is not intended for:
- Safety-critical deployment
- Medical / legal / financial decision-making
- Production systems requiring reliability guarantees
Like other VLMs, the model may generate hallucinated, biased, or incorrect outputs.
Training Data
The model was trained using ShareGPT4V multimodal instruction data under a LLaVA-style multimodal fine-tuning pipeline.
Dataset:
Quantization Details
This checkpoint uses quantization-aware training (QAT) with group-wise W8G128 quantization.
Configuration:
- Weight precision: INT8
- Group size: 128
- Quantization scheme: Group-wise QAT
- Method: GRACE
- Backbone: Qwen3-VL-2B-Instruct
Depending on the inference backend, specialized quantized kernels or custom loading logic may be required to obtain real INT8 deployment benefits.
Repository Files
This repository may contain:
model.safetensors / model-*.safetensors — model weights
qat_quantized_weights.bin — QAT quantized weight artifact
config.json — model configuration
generation_config.json — generation configuration
- tokenizer files
- processor / preprocessing configuration files
Loading
Please use a Qwen3-VL-compatible Transformers environment or the official Qwen3-VL codebase.
from transformers import AutoProcessor
from transformers import AutoModelForImageTextToText
repo_id = "ForeverBlue/Qwen3-VL-2B-GRACE-W8G128"
processor = AutoProcessor.from_pretrained(
repo_id,
trust_remote_code=True
)
model = AutoModelForImageTextToText.from_pretrained(
repo_id,
trust_remote_code=True,
device_map="auto"
)
Recommended:
- recent
transformers version
- Qwen3-VL compatible environment
- CUDA GPU inference backend for large-scale evaluation
Evaluation
The checkpoint follows a LLaVA-style multimodal evaluation protocol.
Representative evaluation may include benchmarks such as:
- HallusionBench
- MMBench
- ScienceQA
- AI2D
- MMMU
- SEED-Bench
- MMStar
Please refer to the associated GRACE paper and the results table above for detailed evaluation settings and results.
Important Notes
This checkpoint includes QAT-specific quantized weights in qat_quantized_weights.bin. Depending on the inference codebase, additional GRACE-specific quantization-aware loading logic may be required.
The standard from_pretrained call may load the model configuration and checkpoint files, but fully reproducing the intended INT8 QAT behavior may require the GRACE repository:
https://github.com/ForeverBlue816/GRACE
Limitations
- This model is released for research purposes.
- The quantized checkpoint may require custom loading logic for QAT-specific weights.
- Performance may vary depending on the evaluation codebase, preprocessing, generation parameters, and multimodal benchmark implementation.
- Users should follow the license and usage restrictions of the original Qwen3-VL-2B-Instruct base model.
- Specialized kernels or custom loading code may be required to realize practical INT8 speed or memory benefits.
Citation
If you use this model, please cite:
@article{chen2026gated,
title={Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs},
author={Chen, Yanlong and Habibian, Amirhossein and Benini, Luca and Li, Yawei},
journal={arXiv preprint arXiv:2601.22709},
year={2026}
}
Please also cite the original Qwen3-VL work when using this model.
License
Released under the MIT license.
Users should additionally comply with:
- Qwen3-VL base model license
- ShareGPT4V dataset terms
- applicable downstream usage restrictions