youngryankim/qwen3.5-0.8b-cost-aware-router API & Inference Endpoint

Architecture (multi-head)

markdown
Qwen3.5-0.8B (LoRA) → last-token hidden h[1024]
   ├── routing_head Linear(1024→2) → sigmoid → (p_haiku, p_opus)   # BCE
   └── token_head   Linear(1024→2) → z-scored log1p(output tokens) # MSE

This repo holds the LoRA adapter (adapter_model.safetensors) plus the two head weights and the token-target normalization stats in heads.pt (routing_head, token_head, hidden, tok_mean, tok_std).

Metrics (600 held-out queries)

routing AUC (B detection): 0.75
opus cost-prediction corr: 0.50
cost-aware curve: +4–6 pp accuracy over random routing at matched budget

Usage

python
import torch, torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download

REPO = "youngryankim/qwen3.5-0.8b-cost-aware-router"
tok = AutoTokenizer.from_pretrained(REPO)
backbone = AutoModel.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.bfloat16)
backbone = PeftModel.from_pretrained(backbone, REPO).eval().cuda()

heads = torch.load(hf_hub_download(REPO, "heads.pt"), map_location="cuda")
rh = nn.Linear(heads["hidden"], 2).bfloat16().cuda(); rh.load_state_dict(heads["routing_head"]); rh.eval()
th = nn.Linear(heads["hidden"], 2).bfloat16().cuda(); th.load_state_dict(heads["token_head"]); th.eval()
mean, std = heads["tok_mean"], heads["tok_std"]

SYS = ("You are a routing model. Read the user query and assess which model can "
       "answer it and how long each answer will be.")

@torch.no_grad()
def route(query, input_tokens=200):
    enc = tok.apply_chat_template([{"role":"system","content":SYS},
                                   {"role":"user","content":query}],
                                  add_generation_prompt=True, tokenize=True,
                                  return_dict=True, return_tensors="pt").to("cuda")
    h = backbone(**enc).last_hidden_state[:, -1, :]
    p_h, p_o = torch.sigmoid(rh(h).float())[0].tolist()
    t = th(h).float()[0].tolist()
    out_h = torch.expm1(torch.tensor(t[0]*std[0]+mean[0])).item()
    out_o = torch.expm1(torch.tensor(t[1]*std[1]+mean[1])).item()
    cost_h = 5e-6*out_h + 1e-6*input_tokens     # haiku $5/$1 per Mtok
    cost_o = 25e-6*out_o + 5e-6*input_tokens    # opus  $25/$5 per Mtok
    score = p_o - p_h                            # routing score
    cost_aware = score / max(cost_o - cost_h, 1e-6)
    return dict(p_haiku=p_h, p_opus=p_o, pred_out_h=out_h, pred_out_o=out_o,
                pred_cost_h=cost_h, pred_cost_o=cost_o, score=score, cost_aware=cost_aware)

print(route("What is 17 * 23?"))

Route to opus when score (or cost_aware, under a budget) exceeds a threshold swept on your validation set.

qwen3.5-0.8b-cost-aware-router

Get help setting up a custom Dedicated Endpoints.

README

Architecture (multi-head)

Metrics (600 held-out queries)

Usage

Explore FriendliAI today

qwen3.5-0.8b-cost-aware-router