Model details
- Base model: Qwen/Qwen3-VL-4B-Instruct (
Qwen3VLForConditionalGeneration)
- Architecture: hidden size 2560, 36 layers
- Modality: image/video + text → text
- Task: future trajectory (waypoint) prediction on nuScenes
- Output: predicted waypoints (no chain-of-thought)
Training
- Fine-tuned on a nuScenes VQA trajectory-only dataset
- Epochs: 30 (10,980 optimizer steps)
- Max sequence length: 2048
Usage
For dataset preparation, prompting, inference, and evaluation, follow the instructions in the project repository:
https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "stanfordasl/nuscenes-waypoints-model"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
Intended use & limitations
This model is a research artifact for autonomous-driving planning experiments. It was trained on nuScenes and is not intended for deployment in real vehicles or safety-critical settings. Outputs may be inaccurate or unsafe; always validate in simulation before any downstream use.
Citation
If you use this model, please cite the RnB-EnCoRe self-driving work: