namansmishaps

PedestrianQA-JAAD-Qwen2.5-VL-3B-Instruct

Model Details

This checkpoint is based on the instruction-tuned 3B Qwen2.5-VL model. Qwen2.5-VL supports image and video-style multimodal inputs and can produce structured text outputs, including coordinates and natural-language explanations.

The PedestrianQA fine-tuning adapts the base model to short-horizon pedestrian prediction in structured and unstructured traffic scenes. Training samples contain observation frames centered on a target pedestrian and QA supervision derived from IDD-PeD, JAAD, PIE, and TITAN.

Intended Use

This model is intended for research on:

pedestrian intention prediction,
pedestrian trajectory prediction,
explainable vision-language models for autonomous driving,
video question answering in traffic scenes,
rationale generation for safety-critical scene understanding.

It is not intended for direct use in deployed autonomous driving, ADAS, traffic enforcement, or other safety-critical production systems.

Input Format

The model expects visual observations of an ego-vehicle scene and a text prompt. In the PedestrianQA setup:

JAAD, PIE, and IDD-PeD use 15 observation frames.
TITAN uses 10 observation frames.
The target pedestrian is identified by their bounding box in the first frame.

Example PIP prompt:

text
Will the pedestrian located at [x1, y1, x2, y2] in frame 1 cross the road?
Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.
Conclude the pedestrian's motives.

Example PTP prompt:

text
Given the trajectory of the pedestrian located at [x1, y1, x2, y2]:
[[x1, y1, x2, y2], ...],
predict their trajectory for the next N frames.
Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.
Predict the pedestrian's final destination and conclude their trajectory.

Output Format

For PIP, the model should produce:

Answer: Yes or No
Spatial_Reason
Temporal_Reason
Mathematical_Reason
Ego_Vehicle_Reason
Scene_Context_Reason
Conclusion

For PTP, the model should produce:

Answer: a list of future bounding boxes in [x1, y1, x2, y2] format
Spatial_Reason
Temporal_Reason
Mathematical_Reason
Ego_Vehicle_Reason
Scene_Context_Reason
Final_Destination
Conclusion

Usage

Install dependencies:

bash
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]

Example inference code:

python
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

model_id = "namansmishaps/PedestrianQA-JAAD-Qwen2.5-VL-3B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

frame_paths = [
    "file:///path/to/frame_0001.jpg",
    "file:///path/to/frame_0002.jpg",
    "file:///path/to/frame_0003.jpg",
]

prompt = (
    "Will the pedestrian located at [1103, 908, 1138, 980] in frame 1 cross the road? "
    "Justify your answer with spatial, temporal, mathematical, ego-vehicle, and "
    "scene-context reasoning. Conclude the pedestrian's motives."
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": frame_paths},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)

generated_ids = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print(output)

Replace the frame paths and target pedestrian box with the sample you want to evaluate.

Training Data

The model is fine-tuned on PedestrianQA, which is derived from four pedestrian datasets:

IDD-PeD
JAAD
PIE
TITAN

PedestrianQA contains 10,717 intention QA samples and 3,594 trajectory QA samples. Each sample includes structured rationales grounded in visual motion, pedestrian pose, ego-vehicle behavior, scene context, and geometric trajectory cues.

Users must acquire all necessary licenses and permissions for the source datasets before reconstructing or using the corresponding video inputs.

Evaluation

The paper evaluates PedestrianQA models on PIP, PTP, and rationale quality.

For the model fine-tuned on all PedestrianQA datasets, the reported overall results are:

Table with columns: Task, Metric, Result
Task	Metric	Result
PIP	Accuracy	0.783
PIP	F1	0.542
PTP	ADE	37 px
PTP	FDE	68 px

Rationale-generation scores for the same model:

Table with columns: Rationale category, Score
Rationale category	Score
Spatial Reasoning	58.36
Temporal Reasoning	54.84
Mathematical Reasoning	51.68
Ego-Vehicle Reasoning	61.51
Scene-Context Reasoning	59.72
Final Destination Prediction	32.24
Conclusion	60.25

Limitations

The model is trained for short-horizon pedestrian behavior prediction and should not be treated as a complete autonomous-driving stack.
Predictions may be wrong in rare, occluded, low-visibility, out-of-domain, or distribution-shifted scenarios.
Generated rationales can be incomplete or hallucinated, especially when the visual evidence is ambiguous.
The model depends on accurate target-pedestrian localization in the prompt.
The model should not be used as the sole basis for safety-critical decisions.

License and Data Terms

This model is based on Qwen/Qwen2.5-VL-3B-Instruct; users must comply with the base model's license and terms.

The fine-tuning data is derived from IDD-PeD, JAAD, PIE, and TITAN. Users must also comply with all upstream dataset licenses and access requirements. Because some source datasets impose non-commercial or access-controlled terms, this model should be treated as a research-use model unless a separate license explicitly states otherwise.

Citation

If you use this model or the PedestrianQA dataset, please cite:

bibtex
@inproceedings{mishra2026pedestrianqa,
  title = {PedestrianQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction},
  author = {Mishra, Naman and Gangisetty, Shankar and Jawahar, C. V.},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year = {2026},
  url = {https://github.com/botmahn/PedestrianQA}
}

Base model:

bibtex
@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

namansmishaps

Model Tree

Base

Qwen/Qwen2.5-VL-3B-Instruct

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Intended Use

This model is intended for research on:

pedestrian intention prediction,
pedestrian trajectory prediction,
explainable vision-language models for autonomous driving,
video question answering in traffic scenes,
rationale generation for safety-critical scene understanding.

It is not intended for direct use in deployed autonomous driving, ADAS, traffic enforcement, or other safety-critical production systems.

Input Format

The model expects visual observations of an ego-vehicle scene and a text prompt. In the PedestrianQA setup:

JAAD, PIE, and IDD-PeD use 15 observation frames.
TITAN uses 10 observation frames.
The target pedestrian is identified by their bounding box in the first frame.

Example PIP prompt:

text
Will the pedestrian located at [x1, y1, x2, y2] in frame 1 cross the road?
Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.
Conclude the pedestrian's motives.

Example PTP prompt:

text
Given the trajectory of the pedestrian located at [x1, y1, x2, y2]:
[[x1, y1, x2, y2], ...],
predict their trajectory for the next N frames.
Justify your answer with spatial, temporal, mathematical, ego-vehicle, and scene-context reasoning.
Predict the pedestrian's final destination and conclude their trajectory.

Output Format

For PIP, the model should produce:

Answer: Yes or No
Spatial_Reason
Temporal_Reason
Mathematical_Reason
Ego_Vehicle_Reason
Scene_Context_Reason
Conclusion

For PTP, the model should produce:

Answer: a list of future bounding boxes in [x1, y1, x2, y2] format
Spatial_Reason
Temporal_Reason
Mathematical_Reason
Ego_Vehicle_Reason
Scene_Context_Reason
Final_Destination
Conclusion

Usage

Install dependencies:

bash
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]

Example inference code:

python
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

model_id = "namansmishaps/PedestrianQA-JAAD-Qwen2.5-VL-3B-Instruct"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

frame_paths = [
    "file:///path/to/frame_0001.jpg",
    "file:///path/to/frame_0002.jpg",
    "file:///path/to/frame_0003.jpg",
]

prompt = (
    "Will the pedestrian located at [1103, 908, 1138, 980] in frame 1 cross the road? "
    "Justify your answer with spatial, temporal, mathematical, ego-vehicle, and "
    "scene-context reasoning. Conclude the pedestrian's motives."
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": frame_paths},
            {"type": "text", "text": prompt},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=512)

generated_ids = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print(output)

Replace the frame paths and target pedestrian box with the sample you want to evaluate.

Training Data

The model is fine-tuned on PedestrianQA, which is derived from four pedestrian datasets:

IDD-PeD
JAAD
PIE
TITAN

Users must acquire all necessary licenses and permissions for the source datasets before reconstructing or using the corresponding video inputs.

Evaluation

The paper evaluates PedestrianQA models on PIP, PTP, and rationale quality.

For the model fine-tuned on all PedestrianQA datasets, the reported overall results are:

Table with columns: Task, Metric, Result
Task	Metric	Result
PIP	Accuracy	0.783
PIP	F1	0.542
PTP	ADE	37 px
PTP	FDE	68 px

Rationale-generation scores for the same model:

Table with columns: Rationale category, Score
Rationale category	Score
Spatial Reasoning	58.36
Temporal Reasoning	54.84
Mathematical Reasoning	51.68
Ego-Vehicle Reasoning	61.51
Scene-Context Reasoning	59.72
Final Destination Prediction	32.24
Conclusion	60.25

Limitations

The model is trained for short-horizon pedestrian behavior prediction and should not be treated as a complete autonomous-driving stack.
Predictions may be wrong in rare, occluded, low-visibility, out-of-domain, or distribution-shifted scenarios.
Generated rationales can be incomplete or hallucinated, especially when the visual evidence is ambiguous.
The model depends on accurate target-pedestrian localization in the prompt.
The model should not be used as the sole basis for safety-critical decisions.

License and Data Terms

This model is based on Qwen/Qwen2.5-VL-3B-Instruct; users must comply with the base model's license and terms.

Citation

If you use this model or the PedestrianQA dataset, please cite:

bibtex
@inproceedings{mishra2026pedestrianqa,
  title = {PedestrianQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction},
  author = {Mishra, Naman and Gangisetty, Shankar and Jawahar, C. V.},
  booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year = {2026},
  url = {https://github.com/botmahn/PedestrianQA}
}

Base model:

bibtex
@article{Qwen2.5-VL,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Zesen and Zhang, Hang and Yang, Zhibo and Xu, Haiyang and Lin, Junyang},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025}
}

PedestrianQA-JAAD-Qwen2.5-VL-3B-Instruct

README

Links

Model Details

Intended Use

Input Format

Output Format

Usage

Training Data

Evaluation

Limitations

License and Data Terms

Citation

Explore FriendliAI today

README

Links

Model Details

Intended Use

Input Format

Output Format

Usage

Training Data

Evaluation

Limitations

License and Data Terms

Citation