chiawen0104/VLMPed-wo-CoT API & Inference Endpoint

Model Details

Developed by: chiawen0104
Model type: Vision-Language Model (LoRA fine-tuned)
Finetuned from: Qwen/Qwen2.5-VL-3B-Instruct
Task: Pedestrian crossing intention prediction (binary: cross / not cross)
Training datasets: JAAD, PIE
Framework: PEFT 0.15.1

How to Get Started

python
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct"
)
model = PeftModel.from_pretrained(base_model, "chiawen0104/VLMPed-wo-CoT")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

Training Details

Base model: Qwen2.5-VL-3B-Instruct
Fine-tuning method: LoRA (via PEFT)
Training regime: bf16 mixed precision
Training data: JAAD and PIE pedestrian crossing intention datasets
CoT supervision: None (direct prediction without chain-of-thought)

Intended Use

This model takes multi-frame pedestrian images as input and predicts whether a pedestrian intends to cross the street. It is intended for research purposes in autonomous driving and pedestrian behavior analysis.

Differences from VLMPed-CoT

	VLMPed-CoT	VLMPed-wo-CoT
CoT supervision	✅	❌
Direct prediction	✅	✅

Reference

Original Paper: VLMPed-CoT: A large vision-language model with a chain-of-thought mechanism for pedestrian crossing intention prediction
Original implementation: lyc2121/VLMPed-CoT-for-Pedestrian-Crossing-Intention-Prediction
Companion model: chiawen0104/VLMPed-CoT

Framework versions

PEFT 0.15.1

VLMPed-wo-CoT

Get help setting up a custom Dedicated Endpoints.

README