q-future
Q-ReAlign-Lite-4B
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What this is
Q-ReAlign scores the perceptual quality / aesthetic appeal of an image or video the way Q-Align does: the model is asked to rate quality, and the probability mass it places on the discrete words excellent / good / fair / poor / bad is collapsed — via a fixed weighting [1.0, 0.75, 0.5, 0.25, 0.0] — into a single scalar in [0, 1].
Lite (4B) is the middle of three sizes (Mini 0.8B · Lite 4B · Pro 9B) and the recommended default: it matches or beats the original Q-Align across the seven QA benchmarks while staying easy to serve.
- Backbone: Qwen3.5-VL (
model_type: qwen3_5), hybrid linear/full attention text tower + SigLIP-style vision encoder - Tasks: IQA (image quality) · IAA (image aesthetics) · VQA (video quality) — the unified ONE-ALIGN setting
- Training: full-parameter SFT in bf16 via ms-swift, vision tower + projector trainable
- Precision: bfloat16 · dtype
auto
Results
Per-dataset SRCC / PLCC on seven QA benchmarks. Lite (4B) reaches avg SRCC 0.889 vs. Q-Align's 0.869.
| Model | KonIQ | SPAQ | KADID | AGI | LIVE | AVA | LSVQ | Avg. |
|---|---|---|---|---|---|---|---|---|
| Q-Align | 0.942 / 0.944 | 0.932 / 0.933 | 0.912 / 0.920 | 0.738 / 0.781 | 0.897 / 0.870 | 0.798 / 0.796 | 0.867 / 0.866 | 0.869 / 0.873 |
| Lite (4B) | 0.943 / 0.941 | 0.932 / 0.934 | 0.928 / 0.931 | 0.829 / 0.871 | 0.899 / 0.862 | 0.814 / 0.804 | 0.880 / 0.879 | 0.889 / 0.889 |
Each cell is SRCC / PLCC, on the full evaluation sets (KonIQ, SPAQ, KADID, AGI, LIVE, AVA, LSVQ).
Quick start
python
import torchfrom PIL import Imagefrom transformers import AutoModelForImageTextToText, AutoProcessor# transformers >= 5.2.0 for Qwen3.5 supportCKPT, IMAGE = "q-future/Q-ReAlign-Lite-4B", "photo.jpg"LEVELS = ["excellent", "good", "fair", "poor", "bad"]WEIGHTS = [1.0, 0.75, 0.5, 0.25, 0.0]device = "cuda" if torch.cuda.is_available() else "cpu"processor = AutoProcessor.from_pretrained(CKPT)model = AutoModelForImageTextToText.from_pretrained(CKPT, dtype="auto").to(device).eval()messages = [{"role": "user", "content": [{"type": "image"},{"type": "text", "text": "How would you rate the quality of this image?"},]}]text = processor.apply_chat_template(messages, add_generation_prompt=True) + "The quality of the image is"inputs = processor(text=[text], images=[Image.open(IMAGE).convert("RGB")], return_tensors="pt").to(device)ids = [processor.tokenizer(" " + w, add_special_tokens=False).input_ids[0] for w in LEVELS]probs = model(**inputs).logits[0, -1, ids].softmax(-1)score = (probs * torch.tensor(WEIGHTS, device=device)).sum().item()print(f"quality score: {score:.4f}") # 0 (worst) .. 1 (best)
The score is the expected value of the level weights under the model's next-token distribution over the five level words — no sampling, one forward pass.
Aesthetics or video
Swap the prompt for the task:
- Aesthetics (IAA): "How would you rate the aesthetics of this image?" → stem "The aesthetics of the image is"
- Video (VQA): sample N frames (default 8) and pass them as the image sequence; prompt "How would you rate the quality of this video?" → stem "The quality of the video is"
Model details
| Lite (4B) | |
|---|---|
| Architecture | Qwen3_5ForConditionalGeneration |
| Text hidden size | 2560 |
| Text layers | 32 (linear attention with full-attention every 4th layer) |
| Vision encoder depth | 24, hidden 1024, patch 16, spatial merge 2 |
| Vocab | 248320 |
| Context length | up to 262144 |
| Tied embeddings | yes |
| Tensor dtype | bfloat16 |
| Shards | 3 × safetensors (~10.3 GB total) |
Scoring contract
- Level vocabulary:
excellent, good, fair, poor, bad - Weights:
[1.0, 0.75, 0.5, 0.25, 0.0] - Output: scalar in
[0, 1], higher = better - The five level tokens are matched with a leading space (
" excellent", …); keep that when porting to other tokenizers.
Intended use & limitations
- Use: no-reference image/video quality assessment, aesthetic scoring, dataset curation, ranking and filtering generated media, reward signals for generative pipelines.
- Out of scope: safety/content moderation, factual or identity judgments, medical/forensic grading. Quality is perceptual and dataset-conditioned.
- Scores are calibrated to the training MOS distribution; absolute values are most meaningful relative to one another. Re-calibrate before mixing with other scales.
Acknowledgements & citation
Built on the shoulders of Q-Align (the discrete text-defined-levels method and ONE-ALIGN), ms-swift (training/inference backbone), and Qwen3.5-VL (the vision-language backbone). If you use this model, please also cite the originals:
bibtex
@inproceedings{wu2024qalign,title = {Q-Align: Teaching {LMM}s for Visual Scoring via Discrete Text-Defined Levels},author = {Wu, Haoning and Zhang, Zicheng and Zhang, Weixia and Chen, Chaofeng andLiao, Liang and Li, Chunyi and Gao, Yixuan and Wang, Annan and Zhang, Erli andSun, Wenxiu and Yan, Qiong and Min, Xiongkuo and Zhai, Guangtao and Lin, Weisi},booktitle = {Proceedings of the 41st International Conference on Machine Learning (ICML)},year = {2024}}@inproceedings{swift2025,title = {{SWIFT}: A Scalable Lightweight Infrastructure for Fine-Tuning},author = {ModelScope Team},booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},year = {2025},note = {\url{https://github.com/modelscope/ms-swift}}}@misc{qwen3_5,title = {Qwen3.5: Towards Native Multimodal Agents},author = {Qwen Team},year = {2025},howpublished = {\url{https://github.com/QwenLM/Qwen3-VL}}}
Model provider
q-future
Model tree
Base
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information