Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

TL;DR

  • Behavior ROUGE-L 0.911 — best of all variants (was 0.036 in the natural-distribution baseline at lr=2e-4).
  • Perception ROUGE-L 0.615 — best of all variants.
  • Prediction ROUGE-L 0.368 — worst of all post-LoRA variants (vs 0.659 for natural-distribution at lr=2e-4). Cost of forcing uniform answer-pattern proportions.
  • Overall ROUGE-L 0.518 — a regression vs the simpler lr=1e-4 fix.

Eval results (3,770-sample DriveLM front-arc, vLLM)

MetricBaselineThis adapter (stratified)
ROUGE-10.1660.524
ROUGE-L0.1570.518
Token-F10.1170.494
Exact match0.4%36.6%
Mean per-request latency1,420 ms1,811 ms

Per question category (ROUGE-L)

CategoryNBaselineThis adapterΔ vs natural-lr2e4
perception1,7380.2170.615+0.127
prediction1,1810.0970.368−0.291
planning8130.1070.507+0.005
behavior380.3050.911+0.875 ↑↑

Why the prediction regression

DriveLM's prediction-category eval set has a natural distribution of ~2/38/16/44% Yes/No/None-pattern/other. The natural-1024 training data matched that distribution almost exactly (7/119/49/136). Uniform stratification forced 50/50/50/100 — the model never learned that "No is 3× more likely than Yes" for prediction, and at inference it gives wrong base rates. The 0.29 ROUGE-L drop on a 1,181-sample category is the cost.

The proportional sibling addresses this by preserving natural within-category proportions with a min-floor on rare classes.

Training Details

Identical to the natural-1024 sibling except for the sample composition:

CategoryYesNoNone-ptnotherTotal
perception5050150250
prediction505050100250
planning505025125250
behavior38 × 4 = 152152
Total902
KnobValue
Learning rate2e-4
Epochs1
Final epoch-avg loss0.440
Training wall clock~18 minutes

The lesson

Failure mode analysis pointed at data, but the actual fix was a hyperparameter. The behavior collapse we attributed to "only 10 behavior samples in training" was mostly the lr=2e-4 being slightly too aggressive — with lr=1e-4, those 10 samples are enough to keep behavior at 0.877. The data-side stratification fixed behavior even harder but at the cost of prediction. Net: the LR fix is the better lever.

Limitations

Same as the series.

License

Apache-2.0.

Model provider

pranavthombare

pranavthombare

Model tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today