Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Training

  • Base: Qwen/Qwen2.5-VL-7B-Instruct
  • Method: DoRA (r=32, alpha=64, dropout=0.05) on attn+MLP proj; merger/projector full fine-tuned; vision tower frozen
  • Data: RefCOCO (lmms-lab/refcoco), 50k ShareGPT-format samples (see data/)
  • Schedule: 2 epochs, lr=1e-5 cosine, effective batch 96, 1042 steps
  • Final train loss: 0.865
  • Hardware: 6x Ascend 910B2 (64GB), torch_npu 2.10, transformers 4.57.6, peft 0.19.1

Evaluation (RefCOCO val, 500 queries)

MetricBase Qwen2.5-VL-7BThis LoRA
acc@0.578.8% (*)91.2%
acc@0.7576.8%
mIoU80.5%
box-format validity0% (**)100%
latency / query (Ascend 910B2)~1.62 s

(*) base parsed fairly from its native bbox_2d JSON output and rescaled to [0,1000]. (**) base does not emit the target [0,1000] (x1,y1),(x2,y2) format; LoRA does, 100% of the time.

Temporal references (synthetic benchmark, 120 scenes / 360 queries)

Constructed object-presence-history benchmark testing temporal referring expressions ("the object that was there earlier"):

conditionacc@0.5
with Temporal Memory log86.1%
without memory (single frame)20.3%

The 66-point gap shows temporal references require an explicit memory module. This benchmark is synthetic — it demonstrates the mechanism, not natural-video performance.

Detection latency (YOLOv8n)

  • ~8 ms median on NVIDIA H20 GPU
  • ~305 ms median on CPU (Ascend training server has no GPU)

Data

  • data/refcoco_en_sft.jsonl — 50k training samples (image filename + referring expr + bbox)
  • data/merged_refs.jsonl — merged RefCOCO refs
  • Images: COCO train2014 (public). Each sample references a COCO filename.

Usage

markdown

from transformers import AutoModelForImageTextToText
from peft import PeftModel
base = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct", dtype="bfloat16")
model = PeftModel.from_pretrained(
base, "Steven668866/URIS-Qwen2.5-VL-7B-RefCOCO-LoRA")

Model provider

Steven668866

Model tree

Base

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today