Model details
- Base model: Qwen/Qwen3-VL-4B-Instruct (
Qwen3VLForConditionalGeneration)
- Architecture: hidden size 2560, 36 layers
- Modality: image/video + text → text
- Task: full driving-scene reasoning + future trajectory (waypoint) prediction on nuScenes
- Output: full chain-of-thought driving rationale followed by predicted waypoints
Training
- Fine-tuned on a nuScenes full-trace VQA-driver dataset (full reasoning + trajectory targets)
- Epochs: 30 (10,980 optimizer steps)
- Max sequence length: 6144
- Learning rate: 5e-5
Usage
For dataset preparation, prompting, inference, and evaluation, follow the instructions in the project repository:
https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "stanfordasl/nuscenes-full-reasoning-waypoints"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
Intended use & limitations
This model is a research artifact for autonomous-driving perception and planning experiments. It was trained on nuScenes and is not intended for deployment in real vehicles or safety-critical settings. Outputs may be inaccurate or unsafe; always validate in simulation before any downstream use.
Citation
If you use this model, please cite the RnB-EnCoRe self-driving work: