stanfordasl/nuscenes-full-reasoning-waypoints API & Inference Endpoint

Model details

Base model: Qwen/Qwen3-VL-4B-Instruct (Qwen3VLForConditionalGeneration)
Architecture: hidden size 2560, 36 layers
Modality: image/video + text → text
Task: full driving-scene reasoning + future trajectory (waypoint) prediction on nuScenes
Output: full chain-of-thought driving rationale followed by predicted waypoints

Training

Fine-tuned on a nuScenes full-trace VQA-driver dataset (full reasoning + trajectory targets)
Epochs: 30 (10,980 optimizer steps)
Max sequence length: 6144
Learning rate: 5e-5

Usage

For dataset preparation, prompting, inference, and evaluation, follow the instructions in the project repository: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "stanfordasl/nuscenes-full-reasoning-waypoints"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Build a chat message with the driving camera image(s) + prompt,
# then processor.apply_chat_template(...) and model.generate(...).
# See the GitHub repo for the exact prompt format and post-processing.

Intended use & limitations

This model is a research artifact for autonomous-driving perception and planning experiments. It was trained on nuScenes and is not intended for deployment in real vehicles or safety-critical settings. Outputs may be inaccurate or unsafe; always validate in simulation before any downstream use.

Citation

If you use this model, please cite the RnB-EnCoRe self-driving work:

Paper: https://arxiv.org/abs/2602.08167
Code: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

Model details

Base model: Qwen/Qwen3-VL-4B-Instruct (Qwen3VLForConditionalGeneration)
Architecture: hidden size 2560, 36 layers
Modality: image/video + text → text
Task: full driving-scene reasoning + future trajectory (waypoint) prediction on nuScenes
Output: full chain-of-thought driving rationale followed by predicted waypoints

Training

Fine-tuned on a nuScenes full-trace VQA-driver dataset (full reasoning + trajectory targets)
Epochs: 30 (10,980 optimizer steps)
Max sequence length: 6144
Learning rate: 5e-5

Usage

For dataset preparation, prompting, inference, and evaluation, follow the instructions in the project repository: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "stanfordasl/nuscenes-full-reasoning-waypoints"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Build a chat message with the driving camera image(s) + prompt,
# then processor.apply_chat_template(...) and model.generate(...).
# See the GitHub repo for the exact prompt format and post-processing.

Intended use & limitations

Citation

If you use this model, please cite the RnB-EnCoRe self-driving work:

Paper: https://arxiv.org/abs/2602.08167
Code: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

nuscenes-full-reasoning-waypoints

Get help setting up a custom Dedicated Endpoints.

README

Model details

Training

Usage

Intended use & limitations

Citation

Explore FriendliAI today

README

Model details

Training

Usage

Intended use & limitations

Citation