Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Training
- Base: Qwen/Qwen2.5-VL-7B-Instruct
- Method: DoRA (r=32, alpha=64, dropout=0.05) on attn+MLP proj; merger/projector full fine-tuned; vision tower frozen
- Data: RefCOCO (lmms-lab/refcoco), 50k ShareGPT-format samples (see data/)
- Schedule: 2 epochs, lr=1e-5 cosine, effective batch 96, 1042 steps
- Final train loss: 0.865
- Hardware: 6x Ascend 910B2 (64GB), torch_npu 2.10, transformers 4.57.6, peft 0.19.1
Evaluation (RefCOCO val, 500 queries)
| Metric | Base Qwen2.5-VL-7B | This LoRA |
|---|---|---|
| acc@0.5 | 78.8% (*) | 91.2% |
| acc@0.75 | — | 76.8% |
| mIoU | — | 80.5% |
| box-format validity | 0% (**) | 100% |
| latency / query (Ascend 910B2) | — | ~1.62 s |
(*) base parsed fairly from its native bbox_2d JSON output and rescaled to [0,1000]. (**) base does not emit the target [0,1000] (x1,y1),(x2,y2) format; LoRA does, 100% of the time.
Temporal references (synthetic benchmark, 120 scenes / 360 queries)
Constructed object-presence-history benchmark testing temporal referring expressions ("the object that was there earlier"):
| condition | acc@0.5 |
|---|---|
| with Temporal Memory log | 86.1% |
| without memory (single frame) | 20.3% |
The 66-point gap shows temporal references require an explicit memory module. This benchmark is synthetic — it demonstrates the mechanism, not natural-video performance.
Detection latency (YOLOv8n)
- ~8 ms median on NVIDIA H20 GPU
- ~305 ms median on CPU (Ascend training server has no GPU)
Data
- data/refcoco_en_sft.jsonl — 50k training samples (image filename + referring expr + bbox)
- data/merged_refs.jsonl — merged RefCOCO refs
- Images: COCO train2014 (public). Each sample references a COCO filename.
Usage
markdown
from transformers import AutoModelForImageTextToTextfrom peft import PeftModelbase = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", dtype="bfloat16")model = PeftModel.from_pretrained(base, "Steven668866/URIS-Qwen2.5-VL-7B-RefCOCO-LoRA")
Model provider
Steven668866
Model tree
Base
Qwen/Qwen2.5-VL-7B-Instruct
Adapter
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information