namansmishaps
PedestrianQA-JAAD-Qwen2.5-VL-3B-Instruct
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Links
- Project webpage: https://botmahn.github.io/pqa_pw/
- Dataset/code: https://github.com/botmahn/PedestrianQA
- Model collection: https://huggingface.co/collections/namansmishaps/pedestrianqa
- Base model: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Model Details
This checkpoint is based on the instruction-tuned 3B Qwen2.5-VL model. Qwen2.5-VL supports image and video-style multimodal inputs and can produce structured text outputs, including coordinates and natural-language explanations.
The PedestrianQA fine-tuning adapts the base model to short-horizon pedestrian prediction in structured and unstructured traffic scenes. Training samples contain observation frames centered on a target pedestrian and QA supervision derived from IDD-PeD, JAAD, PIE, and TITAN.
Intended Use
This model is intended for research on:
- pedestrian intention prediction,
- pedestrian trajectory prediction,
- explainable vision-language models for autonomous driving,
- video question answering in traffic scenes,
- rationale generation for safety-critical scene understanding.
It is not intended for direct use in deployed autonomous driving, ADAS, traffic enforcement, or other safety-critical production systems.
Input Format
The model expects visual observations of an ego-vehicle scene and a text prompt. In the PedestrianQA setup:
- JAAD, PIE, and IDD-PeD use 15 observation frames.
- TITAN uses 10 observation frames.
- The target pedestrian is identified by their bounding box in the first frame.
Example PIP prompt:
text
Will the pedestrian located at [x1, y1, x2, y2] in frame 1 cross the road?Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.Conclude the pedestrian's motives.
Example PTP prompt:
text
Given the trajectory of the pedestrian located at [x1, y1, x2, y2]:[[x1, y1, x2, y2], ...],predict their trajectory for the next N frames.Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.Predict the pedestrian's final destination and conclude their trajectory.
Output Format
For PIP, the model should produce:
Answer:YesorNoSpatial_ReasonTemporal_ReasonMathematical_ReasonEgo_Vehicle_ReasonScene_Context_ReasonConclusion
For PTP, the model should produce:
Answer: a list of future bounding boxes in[x1, y1, x2, y2]formatSpatial_ReasonTemporal_ReasonMathematical_ReasonEgo_Vehicle_ReasonScene_Context_ReasonFinal_DestinationConclusion
Usage
Install dependencies:
bash
pip install git+https://github.com/huggingface/transformers acceleratepip install qwen-vl-utils[decord]
Example inference code:
python
import torchfrom transformers import AutoProcessor, Qwen2_5_VLForConditionalGenerationfrom qwen_vl_utils import process_vision_infomodel_id = "namansmishaps/PedestrianQA-JAAD-Qwen2.5-VL-3B-Instruct"model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id,torch_dtype="auto",device_map="auto",)processor = AutoProcessor.from_pretrained(model_id)frame_paths = ["file:///path/to/frame_0001.jpg","file:///path/to/frame_0002.jpg","file:///path/to/frame_0003.jpg",]prompt = ("Will the pedestrian located at [1103, 908, 1138, 980] in frame 1 cross the road? ""Justify your answer with spatial, temporal, mathematical, ego-vehicle, and ""scene-context reasoning. Conclude the pedestrian's motives.")messages = [{"role": "user","content": [{"type": "video", "video": frame_paths},{"type": "text", "text": prompt},],}]text = processor.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)image_inputs, video_inputs = process_vision_info(messages)inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",).to(model.device)with torch.no_grad():generated_ids = model.generate(**inputs, max_new_tokens=512)generated_ids = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output = processor.batch_decode(generated_ids,skip_special_tokens=True,clean_up_tokenization_spaces=False,)[0]print(output)
Replace the frame paths and target pedestrian box with the sample you want to evaluate.
Training Data
The model is fine-tuned on PedestrianQA, which is derived from four pedestrian datasets:
- IDD-PeD
- JAAD
- PIE
- TITAN
PedestrianQA contains 10,717 intention QA samples and 3,594 trajectory QA samples. Each sample includes structured rationales grounded in visual motion, pedestrian pose, ego-vehicle behavior, scene context, and geometric trajectory cues.
Users must acquire all necessary licenses and permissions for the source datasets before reconstructing or using the corresponding video inputs.
Evaluation
The paper evaluates PedestrianQA models on PIP, PTP, and rationale quality.
For the model fine-tuned on all PedestrianQA datasets, the reported overall results are:
| Task | Metric | Result |
|---|---|---|
| PIP | Accuracy | 0.783 |
| PIP | F1 | 0.542 |
| PTP | ADE | 37 px |
| PTP | FDE | 68 px |
Rationale-generation scores for the same model:
| Rationale category | Score |
|---|---|
| Spatial Reasoning | 58.36 |
| Temporal Reasoning | 54.84 |
| Mathematical Reasoning | 51.68 |
| Ego-Vehicle Reasoning | 61.51 |
| Scene-Context Reasoning | 59.72 |
| Final Destination Prediction | 32.24 |
| Conclusion | 60.25 |
Limitations
- The model is trained for short-horizon pedestrian behavior prediction and should not be treated as a complete autonomous-driving stack.
- Predictions may be wrong in rare, occluded, low-visibility, out-of-domain, or distribution-shifted scenarios.
- Generated rationales can be incomplete or hallucinated, especially when the visual evidence is ambiguous.
- The model depends on accurate target-pedestrian localization in the prompt.
- The model should not be used as the sole basis for safety-critical decisions.
License and Data Terms
This model is based on Qwen/Qwen2.5-VL-3B-Instruct; users must comply with the base model's license and terms.
The fine-tuning data is derived from IDD-PeD, JAAD, PIE, and TITAN. Users must also comply with all upstream dataset licenses and access requirements. Because some source datasets impose non-commercial or access-controlled terms, this model should be treated as a research-use model unless a separate license explicitly states otherwise.
Citation
If you use this model or the PedestrianQA dataset, please cite:
bibtex
@inproceedings{mishra2026pedestrianqa,title = {PedestrianQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction},author = {Mishra, Naman and Gangisetty, Shankar and Jawahar, C. V.},booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},year = {2026},url = {https://github.com/botmahn/PedestrianQA}}
Base model:
bibtex
@article{Qwen2.5-VL,title={Qwen2.5-VL Technical Report},author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},journal={arXiv preprint arXiv:2502.13923},year={2025}}
Model provider
namansmishaps
Model tree
Base
Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information