Steven668866/URIS-Qwen2.5-VL-7B-RefCOCO-LoRA API & Inference Endpoint

Training

Base: Qwen/Qwen2.5-VL-7B-Instruct
Method: DoRA (r=32, alpha=64, dropout=0.05) on attn+MLP proj; merger/projector full fine-tuned; vision tower frozen
Data: RefCOCO (lmms-lab/refcoco), 50k ShareGPT-format samples (see data/)
Schedule: 2 epochs, lr=1e-5 cosine, effective batch 96, 1042 steps
Final train loss: 0.865
Hardware: 6x Ascend 910B2 (64GB), torch_npu 2.10, transformers 4.57.6, peft 0.19.1

Evaluation (RefCOCO val, 500 queries)

Metric	Base Qwen2.5-VL-7B	This LoRA
acc@0.5	78.8% (*)	91.2%
acc@0.75	—	76.8%
mIoU	—	80.5%
box-format validity	0% (**)	100%
latency / query (Ascend 910B2)	—	~1.62 s

(*) base parsed fairly from its native bbox_2d JSON output and rescaled to [0,1000]. (**) base does not emit the target [0,1000] (x1,y1),(x2,y2) format; LoRA does, 100% of the time.

Temporal references (synthetic benchmark, 120 scenes / 360 queries)

Constructed object-presence-history benchmark testing temporal referring expressions ("the object that was there earlier"):

condition	acc@0.5
with Temporal Memory log	86.1%
without memory (single frame)	20.3%

The 66-point gap shows temporal references require an explicit memory module. This benchmark is synthetic — it demonstrates the mechanism, not natural-video performance.

Detection latency (YOLOv8n)

~8 ms median on NVIDIA H20 GPU
~305 ms median on CPU (Ascend training server has no GPU)

Data

data/refcoco_en_sft.jsonl — 50k training samples (image filename + referring expr + bbox)
data/merged_refs.jsonl — merged RefCOCO refs
Images: COCO train2014 (public). Each sample references a COCO filename.

Usage

markdown
from transformers import AutoModelForImageTextToText
from peft import PeftModel
base = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", dtype="bfloat16")
model = PeftModel.from_pretrained(
    base, "Steven668866/URIS-Qwen2.5-VL-7B-RefCOCO-LoRA")

URIS-Qwen2.5-VL-7B-RefCOCO-LoRA

Get help setting up a custom Dedicated Endpoints.

README