pranavthombare

qwen3.5-0.8b-drivelm-lora-proportional

README

License: apache-2.0

Eval results (3,770-sample DriveLM front-arc, vLLM)

Table with columns: Metric, Baseline, This adapter (prop + lr=1e-4), Δ
Metric	Baseline	This adapter (prop + lr=1e-4)	Δ
ROUGE-1	0.166	0.627	+0.461
ROUGE-2	0.069	0.257	+0.188
ROUGE-L	0.157	0.621	+0.464
Token-F1	0.117	0.602	+0.485
Exact match	0.4%	47.4%	+47.0 pp
Mean per-request latency	1,420 ms	1,858 ms	+438 ms

Per question category (ROUGE-L)

Table with columns: Category, N, Baseline, This adapter
Category	N	Baseline	This adapter
perception	1,738	0.217	0.625 ⭐
prediction	1,181	0.097	0.682
planning	813	0.107	0.543 ⭐
behavior	38	0.305	0.201

Best-of-series for three of four categories. Behavior is the trade-off (next section).

Position in the ablation series

Table with columns: Config, Sampling, lr, Overall RL, Perception, Prediction, Planning, Behavior
Config	Sampling	lr	Overall RL	Perception	Prediction	Planning	Behavior
nat 2e-4	natural	2e-4	0.541	0.489	0.659	0.502	0.036
nat 1e-4	natural	1e-4	0.581

Different configs win different production targets:

For behavior-heavy use cases (ego-status, predictability) → use nat 1e-4
For overall quality + perception/prediction/planning → use this adapter (prop 1e-4)

The trade-off: why behavior is 0.201 here vs 0.877 in lr1e4

Proportional sampling injects all 38 behavior samples × 4 upsample = 152 instances into training — identical to the uniform-stratified variant. So the behavior gradient signal is the same.

The difference is in the competing other-category gradients. Proportional sampling preserves the natural answer-pattern distribution within perception/prediction/planning (e.g. prediction stays No-heavy at 85/15/40/110 instead of forced 50/50/50/100). This is harder to fit — the LoRA's r=8 capacity gets pulled toward the dominant patterns of the larger categories. The 152 behavior signals get partially crowded out.

A weighted variant with behavior upsample 8× or 12× would likely close the behavior gap while keeping the overall wins. That's the obvious next experiment.

Training Details

Table

Base model	`Qwen/Qwen3.5-0.8B`
Adapter type	QLoRA (NF4 4-bit base + LoRA r=8)
LoRA rank / alpha	8 / 16
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`

Reproducing this adapter

bash
DRIVELM_TRAIN__SAMPLING=proportional \
DRIVELM_TRAIN__LR=1e-4 \
DRIVELM_TRAIN__OUTPUT_DIR=models/qwen-lora-prop-lr1e4 \
.venv/bin/python src/train/finetune.py

The proportional sampler is in src/data/pipeline.py::proportional_samples.

Usage

python
from peft import PeftModel
from transformers import AutoProcessor, AutoModelForImageTextToText

base = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional").eval()

Limitations

Train/eval overlap. Training set is a subset of the eval set.
Behavior trade-off. This adapter scores 0.201 on behavior vs 0.877 for the lr=1e-4 natural sibling. Choose the right adapter for your use case.
No referent-token grounding (<c1,CAM_FRONT,x,y> ignored).
No CAN-bus signal access for behavior ego-velocity attributes.
nuScenes-mini scope — 38 frames, 6 scenes, daylight bias.

License

Apache-2.0.

Framework versions

PEFT 0.19.1
transformers (HuggingFace main as of training date)
bitsandbytes 0.49.2

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

pranavthombare

Model Tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities