pranavthombare

qwen3.5-0.8b-drivelm-lora-stratified

README

License: apache-2.0

TL;DR

Behavior ROUGE-L 0.911 — best of all variants (was 0.036 in the natural-distribution baseline at lr=2e-4).
Perception ROUGE-L 0.615 — best of all variants.
Prediction ROUGE-L 0.368 — worst of all post-LoRA variants (vs 0.659 for natural-distribution at lr=2e-4). Cost of forcing uniform answer-pattern proportions.
Overall ROUGE-L 0.518 — a regression vs the simpler lr=1e-4 fix.

Eval results (3,770-sample DriveLM front-arc, vLLM)

Table with columns: Metric, Baseline, This adapter (stratified)
Metric	Baseline	This adapter (stratified)
ROUGE-1	0.166	0.524
ROUGE-L	0.157	0.518
Token-F1	0.117	0.494
Exact match	0.4%	36.6%
Mean per-request latency	1,420 ms	1,811 ms

Per question category (ROUGE-L)

Table with columns: Category, N, Baseline, This adapter, Δ vs natural-lr2e4
Category	N	Baseline	This adapter	Δ vs natural-lr2e4
perception	1,738	0.217	0.615	+0.127 ↑
prediction	1,181	0.097	0.368	−0.291 ↓
planning	813	0.107	0.507	+0.005

Why the prediction regression

DriveLM's prediction-category eval set has a natural distribution of ~2/38/16/44% Yes/No/None-pattern/other. The natural-1024 training data matched that distribution almost exactly (7/119/49/136). Uniform stratification forced 50/50/50/100 — the model never learned that "No is 3× more likely than Yes" for prediction, and at inference it gives wrong base rates. The 0.29 ROUGE-L drop on a 1,181-sample category is the cost.

The proportional sibling addresses this by preserving natural within-category proportions with a min-floor on rare classes.

Training Details

Identical to the natural-1024 sibling except for the sample composition:

Table with columns: Category, Yes, No, None-ptn, other, Total
Category	Yes	No	None-ptn	other	Total
perception	50	50	—	150	250
prediction	50	50	50	100	250
planning	50	50

Table with columns: Knob, Value
Knob	Value
Learning rate	2e-4
Epochs	1
Final epoch-avg loss	0.440
Training wall clock	~18 minutes

The lesson

Failure mode analysis pointed at data, but the actual fix was a hyperparameter. The behavior collapse we attributed to "only 10 behavior samples in training" was mostly the lr=2e-4 being slightly too aggressive — with lr=1e-4, those 10 samples are enough to keep behavior at 0.877. The data-side stratification fixed behavior even harder but at the cost of prediction. Net: the LR fix is the better lever.

Limitations

Same as the series.

License

Apache-2.0.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

pranavthombare

Model Tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities