stanfordasl/nuscenes-waypoints-model API & Inference Endpoint

Model details

Base model: Qwen/Qwen3-VL-4B-Instruct (Qwen3VLForConditionalGeneration)
Architecture: hidden size 2560, 36 layers
Modality: image/video + text → text
Task: future trajectory (waypoint) prediction on nuScenes
Output: predicted waypoints (no chain-of-thought)

Training

Fine-tuned on a nuScenes VQA trajectory-only dataset
Epochs: 30 (10,980 optimizer steps)
Max sequence length: 2048

Usage

For dataset preparation, prompting, inference, and evaluation, follow the instructions in the project repository: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "stanfordasl/nuscenes-waypoints-model"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Build a chat message with the driving camera image(s) + prompt,
# then processor.apply_chat_template(...) and model.generate(...).
# See the GitHub repo for the exact prompt format and post-processing.

Intended use & limitations

This model is a research artifact for autonomous-driving planning experiments. It was trained on nuScenes and is not intended for deployment in real vehicles or safety-critical settings. Outputs may be inaccurate or unsafe; always validate in simulation before any downstream use.

Citation

If you use this model, please cite the RnB-EnCoRe self-driving work:

Paper: https://arxiv.org/abs/2602.08167
Code: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

Model details

Base model: Qwen/Qwen3-VL-4B-Instruct (Qwen3VLForConditionalGeneration)
Architecture: hidden size 2560, 36 layers
Modality: image/video + text → text
Task: future trajectory (waypoint) prediction on nuScenes
Output: predicted waypoints (no chain-of-thought)

Training

Fine-tuned on a nuScenes VQA trajectory-only dataset
Epochs: 30 (10,980 optimizer steps)
Max sequence length: 2048

Usage

For dataset preparation, prompting, inference, and evaluation, follow the instructions in the project repository: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "stanfordasl/nuscenes-waypoints-model"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Build a chat message with the driving camera image(s) + prompt,
# then processor.apply_chat_template(...) and model.generate(...).
# See the GitHub repo for the exact prompt format and post-processing.

Intended use & limitations

Citation

If you use this model, please cite the RnB-EnCoRe self-driving work:

Paper: https://arxiv.org/abs/2602.08167
Code: https://github.com/rnb-encore/RnB-EnCoRe-SelfDriving

nuscenes-waypoints-model

Get help setting up a custom Dedicated Endpoints.

README

Model details

Training

Usage

Intended use & limitations

Citation

Explore FriendliAI today

README

Model details

Training

Usage

Intended use & limitations

Citation