ByteDance

EvoQuality

Deploy Dedicated

README

License: apache-2.0

1. Model Overview

Model Name: EvoQuality (Self-Evolving VLM for Image Quality Assessment)
Task: No-Reference Image Quality Assessment (NR-IQA), supporting both single-image quality scoring and pairwise quality comparison (ranking)
Core Idea: Without relying on any human-annotated quality scores or distortion-type labels, EvoQuality generates pseudo-ranking labels via pairwise majority voting, and converts them into an optimizable reward signal through GRPO to iteratively self-evolve its quality perception capability
Paper: Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking (ICLR 2026, arXiv:2509.25787)

2. Model and Framework Details

Backbone Model (paper setting): Qwen2.5-VL-7B (used as the baseline policy)
Training Paradigm: Two-stage cycle, supports multi-round iteration (T=2 in the paper)
- Offline Stage (Pseudo-label): Perform K comparisons on randomly sampled image pairs, then derive pseudo-preferences p*(xi, xj) via majority voting
- Online Stage (RL): Convert pseudo-preferences into a fidelity reward and update the policy via Group Relative Policy Optimization (GRPO) (full fine-tuning of the VLM)

3. Prompts

Offline Comparison c_compare:
- <image><image> You are performing an image quality assessment task. Compare the two images and decide which one has better perceptual quality. Answer strictly with the index of the better image: 0 if the first image is better, or 1 if the second image is better.
Online Scoring c_score:
- <image> You are doing the image quality assessment task. Here is the question: What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality.
Reasoning Suffix (for self-consistency sampling):
- You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in boxed{}.

4. Training

Number of Iterations: T = 1 (the open-sourced model weights are the result of the first round of self-evolution)
Training Data: No additional synthetic distortion data and no extra annotated labels were added when producing the released weights
Offline Stage:
- Sample K=32 responses per pair, then derive pseudo-labels via majority voting
- Randomly swap image order to mitigate positional bias
Online Stage (GRPO):
- Sample K=32 responses per sample (c_score)
- Optimizer: AdamW, initial learning rate 3e-7, with linear decay
- KL coefficient:

5. Evaluation Metrics

Evaluation Setting: zero-shot (no training on the target test sets)
Metrics: PLCC, SRCC (consistency with human subjective quality)

6. Main Results

Improvement over the Backbone Model (Qwen2.5-VL-7B): weighted average (WA VG.) over multiple benchmarks
- PLCC: 0.615 -> 0.770 (+31.8%)
- SRCC: 0.570 -> 0.726 (+33.7%)
Generalization: Achieves significant improvements across diverse distortion types and AI-generated content, matching or surpassing several supervised VLM-IQA approaches on multiple benchmarks (see the paper for detailed tables)

7. Intended Use and Usage Guidelines

Recommended Use
- Research and evaluation: NR-IQA, cross-dataset generalization comparison, quality ranking/filtering, auxiliary signals for data cleaning
- Pre-production assessment: as a perceptual quality proxy, but should be combined with business data and manual spot-check validation
Not Recommended Use
- As the sole quality criterion for high-stakes decisions (content moderation, medical imaging diagnostic conclusions, legal evidence adjudication, etc.)
- Treating model outputs as "absolute objective ground truth" (IQA is inherently subjective and correlated with population preferences)
Output Notes
- The paper's prompts require outputs in the form of <think>...</think> with boxed{score}; for actual integration, it is recommended to parse only the value inside boxed{} and consider how temperature/sampling strategies affect consistency

8. Limitations and Known Risks

Self-supervised Pseudo-label Bias: Pseudo-rankings are derived from the model's own votes, which may amplify the systematic preferences or blind spots of the backbone model
Domain Shift: May fail on images from specific domains (medical, remote sensing, industrial inspection)
Subjectivity and Population Differences: Different cultural/aesthetic preferences and task objectives (aesthetics vs. clarity) can change the definition of "quality"
Prompt Sensitivity: Variations in prompts, sampling count K, and decoding strategies can affect self-consistency voting and final performance

9. Citation

bibtex
@article{wen2025selfevolving,
  title={Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking},
  author={Wen, Wen and Zhi, Tianwu and Fan, Kanglong and Li, Yang and Peng, Xinge and Zhang, Yabin and Liao, Yiting and Li, Junlin and Zhang, Li},
  journal={arXiv preprint arXiv:2509.25787},
  year={2025}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ByteDance

Model Tree

Base

this model

Input Modalities

Text

Image

Output Modalities