Eval results (3,770-sample DriveLM front-arc, vLLM)
Table with columns: Metric, Baseline, This adapter (prop + lr=1e-4), Δ| Metric | Baseline | This adapter (prop + lr=1e-4) | Δ |
|---|
| ROUGE-1 | 0.166 | 0.627 | +0.461 |
| ROUGE-2 | 0.069 | 0.257 | +0.188 |
| ROUGE-L | 0.157 | 0.621 | +0.464 |
| Token-F1 | 0.117 | 0.602 | +0.485 |
| Exact match | 0.4% | 47.4% | +47.0 pp |
| Mean per-request latency | 1,420 ms | 1,858 ms | +438 ms |
Per question category (ROUGE-L)
Table with columns: Category, N, Baseline, This adapter| Category | N | Baseline | This adapter |
|---|
| perception | 1,738 | 0.217 | 0.625 ⭐ |
| prediction | 1,181 | 0.097 | 0.682 |
| planning | 813 | 0.107 | 0.543 ⭐ |
| behavior | 38 | 0.305 | 0.201 |
Best-of-series for three of four categories. Behavior is the trade-off (next section).
Position in the ablation series
Table with columns: Config, Sampling, lr, Overall RL, Perception, Prediction, Planning, Behavior| Config | Sampling | lr | Overall RL | Perception | Prediction | Planning | Behavior |
|---|
| nat 2e-4 | natural | 2e-4 | 0.541 | 0.489 | 0.659 | 0.502 | 0.036 |
| nat 1e-4 | natural | 1e-4 | 0.581 |
Different configs win different production targets:
- For behavior-heavy use cases (ego-status, predictability) → use
nat 1e-4
- For overall quality + perception/prediction/planning → use this adapter (
prop 1e-4)
The trade-off: why behavior is 0.201 here vs 0.877 in lr1e4
Proportional sampling injects all 38 behavior samples × 4 upsample = 152 instances into training — identical to the uniform-stratified variant. So the behavior gradient signal is the same.
The difference is in the competing other-category gradients. Proportional sampling preserves the natural answer-pattern distribution within perception/prediction/planning (e.g. prediction stays No-heavy at 85/15/40/110 instead of forced 50/50/50/100). This is harder to fit — the LoRA's r=8 capacity gets pulled toward the dominant patterns of the larger categories. The 152 behavior signals get partially crowded out.
A weighted variant with behavior upsample 8× or 12× would likely close the behavior gap while keeping the overall wins. That's the obvious next experiment.
Training Details
Table | |
|---|
| Base model | Qwen/Qwen3.5-0.8B |
| Adapter type | QLoRA (NF4 4-bit base + LoRA r=8) |
| LoRA rank / alpha | 8 / 16 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Reproducing this adapter
DRIVELM_TRAIN__SAMPLING=proportional \
DRIVELM_TRAIN__LR=1e-4 \
DRIVELM_TRAIN__OUTPUT_DIR=models/qwen-lora-prop-lr1e4 \
.venv/bin/python src/train/finetune.py
The proportional sampler is in src/data/pipeline.py::proportional_samples.
Usage
from peft import PeftModel
from transformers import AutoProcessor, AutoModelForImageTextToText
base = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B", trust_remote_code=True)
model = PeftModel.from_pretrained(base, "pranavthombare/qwen3.5-0.8b-drivelm-lora-proportional").eval()
Limitations
- Train/eval overlap. Training set is a subset of the eval set.
- Behavior trade-off. This adapter scores 0.201 on behavior vs 0.877 for the lr=1e-4 natural sibling. Choose the right adapter for your use case.
- No referent-token grounding (
<c1,CAM_FRONT,x,y> ignored).
- No CAN-bus signal access for behavior ego-velocity attributes.
- nuScenes-mini scope — 38 frames, 6 scenes, daylight bias.
License
Apache-2.0.
Framework versions
- PEFT 0.19.1
- transformers (HuggingFace
main as of training date)
- bitsandbytes 0.49.2