Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0TL;DR
- Behavior ROUGE-L 0.911 — best of all variants (was 0.036 in the natural-distribution baseline at lr=2e-4).
- Perception ROUGE-L 0.615 — best of all variants.
- Prediction ROUGE-L 0.368 — worst of all post-LoRA variants (vs 0.659 for natural-distribution at lr=2e-4). Cost of forcing uniform answer-pattern proportions.
- Overall ROUGE-L 0.518 — a regression vs the simpler lr=1e-4 fix.
Eval results (3,770-sample DriveLM front-arc, vLLM)
| Metric | Baseline | This adapter (stratified) |
|---|---|---|
| ROUGE-1 | 0.166 | 0.524 |
| ROUGE-L | 0.157 | 0.518 |
| Token-F1 | 0.117 | 0.494 |
| Exact match | 0.4% | 36.6% |
| Mean per-request latency | 1,420 ms | 1,811 ms |
Per question category (ROUGE-L)
| Category | N | Baseline | This adapter | Δ vs natural-lr2e4 |
|---|---|---|---|---|
| perception | 1,738 | 0.217 | 0.615 | +0.127 ↑ |
| prediction | 1,181 | 0.097 | 0.368 | −0.291 ↓ |
| planning | 813 | 0.107 | 0.507 | +0.005 |
| behavior | 38 | 0.305 | 0.911 | +0.875 ↑↑ |
Why the prediction regression
DriveLM's prediction-category eval set has a natural distribution of ~2/38/16/44% Yes/No/None-pattern/other. The natural-1024 training data matched that distribution almost exactly (7/119/49/136). Uniform stratification forced 50/50/50/100 — the model never learned that "No is 3× more likely than Yes" for prediction, and at inference it gives wrong base rates. The 0.29 ROUGE-L drop on a 1,181-sample category is the cost.
The proportional sibling addresses this by preserving natural within-category proportions with a min-floor on rare classes.
Training Details
Identical to the natural-1024 sibling except for the sample composition:
| Category | Yes | No | None-ptn | other | Total |
|---|---|---|---|---|---|
| perception | 50 | 50 | — | 150 | 250 |
| prediction | 50 | 50 | 50 | 100 | 250 |
| planning | 50 | 50 | 25 | 125 | 250 |
| behavior | — | — | — | 38 × 4 = 152 | 152 |
| Total | 902 |
| Knob | Value |
|---|---|
| Learning rate | 2e-4 |
| Epochs | 1 |
| Final epoch-avg loss | 0.440 |
| Training wall clock | ~18 minutes |
The lesson
Failure mode analysis pointed at data, but the actual fix was a hyperparameter. The behavior collapse we attributed to "only 10 behavior samples in training" was mostly the lr=2e-4 being slightly too aggressive — with lr=1e-4, those 10 samples are enough to keep behavior at 0.877. The data-side stratification fixed behavior even harder but at the cost of prediction. Net: the LR fix is the better lever.
Limitations
Same as the series.
License
Apache-2.0.
Model provider
pranavthombare
Model tree
Base
Qwen/Qwen3.5-0.8B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information