TL;DR
- Behavior ROUGE-L 0.911 — best of all variants (was 0.036 in the natural-distribution baseline at lr=2e-4).
- Perception ROUGE-L 0.615 — best of all variants.
- Prediction ROUGE-L 0.368 — worst of all post-LoRA variants (vs 0.659 for natural-distribution at lr=2e-4). Cost of forcing uniform answer-pattern proportions.
- Overall ROUGE-L 0.518 — a regression vs the simpler lr=1e-4 fix.
Eval results (3,770-sample DriveLM front-arc, vLLM)
Table with columns: Metric, Baseline, This adapter (stratified)| Metric | Baseline | This adapter (stratified) |
|---|
| ROUGE-1 | 0.166 | 0.524 |
| ROUGE-L | 0.157 | 0.518 |
| Token-F1 | 0.117 | 0.494 |
| Exact match | 0.4% | 36.6% |
| Mean per-request latency | 1,420 ms | 1,811 ms |
Per question category (ROUGE-L)
Table with columns: Category, N, Baseline, This adapter, Δ vs natural-lr2e4| Category | N | Baseline | This adapter | Δ vs natural-lr2e4 |
|---|
| perception | 1,738 | 0.217 | 0.615 | +0.127 ↑ |
| prediction | 1,181 | 0.097 | 0.368 | −0.291 ↓ |
| planning | 813 | 0.107 | 0.507 | +0.005 |
Why the prediction regression
DriveLM's prediction-category eval set has a natural distribution of ~2/38/16/44% Yes/No/None-pattern/other. The natural-1024 training data matched that distribution almost exactly (7/119/49/136). Uniform stratification forced 50/50/50/100 — the model never learned that "No is 3× more likely than Yes" for prediction, and at inference it gives wrong base rates. The 0.29 ROUGE-L drop on a 1,181-sample category is the cost.
The proportional sibling addresses this by preserving natural within-category proportions with a min-floor on rare classes.
Training Details
Identical to the natural-1024 sibling except for the sample composition:
Table with columns: Category, Yes, No, None-ptn, other, Total| Category | Yes | No | None-ptn | other | Total |
|---|
| perception | 50 | 50 | — | 150 | 250 |
| prediction | 50 | 50 | 50 | 100 | 250 |
| planning | 50 | 50 |
Table with columns: Knob, Value| Knob | Value |
|---|
| Learning rate | 2e-4 |
| Epochs | 1 |
| Final epoch-avg loss | 0.440 |
| Training wall clock | ~18 minutes |
The lesson
Failure mode analysis pointed at data, but the actual fix was a hyperparameter. The behavior collapse we attributed to "only 10 behavior samples in training" was mostly the lr=2e-4 being slightly too aggressive — with lr=1e-4, those 10 samples are enough to keep behavior at 0.877. The data-side stratification fixed behavior even harder but at the cost of prediction. Net: the LR fix is the better lever.
Limitations
Same as the series.
License
Apache-2.0.