namansmishaps

PedestrianQA-TITAN-Qwen2.5-VL-3B-Instruct

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Model Details

This checkpoint is based on the instruction-tuned 3B Qwen2.5-VL model. Qwen2.5-VL supports image and video-style multimodal inputs and can produce structured text outputs, including coordinates and natural-language explanations.

The PedestrianQA fine-tuning adapts the base model to short-horizon pedestrian prediction in structured and unstructured traffic scenes. Training samples contain observation frames centered on a target pedestrian and QA supervision derived from IDD-PeD, JAAD, PIE, and TITAN.

Intended Use

This model is intended for research on:

  • pedestrian intention prediction,
  • pedestrian trajectory prediction,
  • explainable vision-language models for autonomous driving,
  • video question answering in traffic scenes,
  • rationale generation for safety-critical scene understanding.

It is not intended for direct use in deployed autonomous driving, ADAS, traffic enforcement, or other safety-critical production systems.

Input Format

The model expects visual observations of an ego-vehicle scene and a text prompt. In the PedestrianQA setup:

  • JAAD, PIE, and IDD-PeD use 15 observation frames.
  • TITAN uses 10 observation frames.
  • The target pedestrian is identified by their bounding box in the first frame.

Example PIP prompt:

text

Will the pedestrian located at [x1, y1, x2, y2] in frame 1 cross the road?
Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.
Conclude the pedestrian's motives.

Example PTP prompt:

text

Given the trajectory of the pedestrian located at [x1, y1, x2, y2]:
[[x1, y1, x2, y2], ...],
predict their trajectory for the next N frames.
Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.
Predict the pedestrian's final destination and conclude their trajectory.

Output Format

For PIP, the model should produce:

  • Answer: Yes or No
  • Spatial_Reason
  • Temporal_Reason
  • Mathematical_Reason
  • Ego_Vehicle_Reason
  • Scene_Context_Reason
  • Conclusion

For PTP, the model should produce:

  • Answer: a list of future bounding boxes in [x1, y1, x2, y2] format
  • Spatial_Reason
  • Temporal_Reason
  • Mathematical_Reason
  • Ego_Vehicle_Reason
  • Scene_Context_Reason
  • Final_Destination
  • Conclusion

Usage

Install dependencies:

bash

pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]

Example inference code:

python

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
model_id = "namansmishaps/PedestrianQA-TITAN-Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
frame_paths = [
"file:///path/to/frame_0001.jpg",
"file:///path/to/frame_0002.jpg",
"file:///path/to/frame_0003.jpg",
]
prompt = (
"Will the pedestrian located at [1103, 908, 1138, 980] in frame 1 cross the road? "
"Justify your answer with spatial, temporal, mathematical, ego-vehicle, and "
"scene-context reasoning. Conclude the pedestrian's motives."
)
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": frame_paths},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(output)

Replace the frame paths and target pedestrian box with the sample you want to evaluate.

Training Data

The model is fine-tuned on PedestrianQA, which is derived from four pedestrian datasets:

  • IDD-PeD
  • JAAD
  • PIE
  • TITAN

PedestrianQA contains 10,717 intention QA samples and 3,594 trajectory QA samples. Each sample includes structured rationales grounded in visual motion, pedestrian pose, ego-vehicle behavior, scene context, and geometric trajectory cues.

Users must acquire all necessary licenses and permissions for the source datasets before reconstructing or using the corresponding video inputs.

Evaluation

The paper evaluates PedestrianQA models on PIP, PTP, and rationale quality.

For the model fine-tuned on all PedestrianQA datasets, the reported overall results are:

Table
TaskMetricResult
PIPAccuracy0.783
PIPF10.542
PTPADE37 px
PTPFDE68 px

Rationale-generation scores for the same model:

Table
Rationale categoryScore
Spatial Reasoning58.36
Temporal Reasoning54.84
Mathematical Reasoning51.68
Ego-Vehicle Reasoning61.51
Scene-Context Reasoning59.72
Final Destination Prediction32.24
Conclusion60.25

Limitations

  • The model is trained for short-horizon pedestrian behavior prediction and should not be treated as a complete autonomous-driving stack.
  • Predictions may be wrong in rare, occluded, low-visibility, out-of-domain, or distribution-shifted scenarios.
  • Generated rationales can be incomplete or hallucinated, especially when the visual evidence is ambiguous.
  • The model depends on accurate target-pedestrian localization in the prompt.
  • The model should not be used as the sole basis for safety-critical decisions.

License and Data Terms

This model is based on Qwen/Qwen2.5-VL-3B-Instruct; users must comply with the base model's license and terms.

The fine-tuning data is derived from IDD-PeD, JAAD, PIE, and TITAN. Users must also comply with all upstream dataset licenses and access requirements. Because some source datasets impose non-commercial or access-controlled terms, this model should be treated as a research-use model unless a separate license explicitly states otherwise.

Citation

If you use this model or the PedestrianQA dataset, please cite:

bibtex

@inproceedings{mishra2026pedestrianqa,
title = {PedestrianQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction},
author = {Mishra, Naman and Gangisetty, Shankar and Jawahar, C. V.},
booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026},
url = {https://github.com/botmahn/PedestrianQA}
}

Base model:

bibtex

@article{Qwen2.5-VL,
title={Qwen2.5-VL Technical Report},
author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
journal={arXiv preprint arXiv:2502.13923},
year={2025}
}

Model provider

namansmishaps

Model tree

Base

Qwen/Qwen2.5-VL-3B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today