Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Architecture (multi-head)
markdown
Qwen3.5-0.8B (LoRA) → last-token hidden h[1024]├── routing_head Linear(1024→2) → sigmoid → (p_haiku, p_opus) # BCE└── token_head Linear(1024→2) → z-scored log1p(output tokens) # MSE
This repo holds the LoRA adapter (adapter_model.safetensors) plus the two
head weights and the token-target normalization stats in heads.pt
(routing_head, token_head, hidden, tok_mean, tok_std).
Metrics (600 held-out queries)
- routing AUC (B detection): 0.75
- opus cost-prediction corr: 0.50
- cost-aware curve: +4–6 pp accuracy over random routing at matched budget
Usage
python
import torch, torch.nn as nnfrom transformers import AutoModel, AutoTokenizerfrom peft import PeftModelfrom huggingface_hub import hf_hub_downloadREPO = "youngryankim/qwen3.5-0.8b-cost-aware-router"tok = AutoTokenizer.from_pretrained(REPO)backbone = AutoModel.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.bfloat16)backbone = PeftModel.from_pretrained(backbone, REPO).eval().cuda()heads = torch.load(hf_hub_download(REPO, "heads.pt"), map_location="cuda")rh = nn.Linear(heads["hidden"], 2).bfloat16().cuda(); rh.load_state_dict(heads["routing_head"]); rh.eval()th = nn.Linear(heads["hidden"], 2).bfloat16().cuda(); th.load_state_dict(heads["token_head"]); th.eval()mean, std = heads["tok_mean"], heads["tok_std"]SYS = ("You are a routing model. Read the user query and assess which model can ""answer it and how long each answer will be.")@torch.no_grad()def route(query, input_tokens=200):enc = tok.apply_chat_template([{"role":"system","content":SYS},{"role":"user","content":query}],add_generation_prompt=True, tokenize=True,return_dict=True, return_tensors="pt").to("cuda")h = backbone(**enc).last_hidden_state[:, -1, :]p_h, p_o = torch.sigmoid(rh(h).float())[0].tolist()t = th(h).float()[0].tolist()out_h = torch.expm1(torch.tensor(t[0]*std[0]+mean[0])).item()out_o = torch.expm1(torch.tensor(t[1]*std[1]+mean[1])).item()cost_h = 5e-6*out_h + 1e-6*input_tokens # haiku $5/$1 per Mtokcost_o = 25e-6*out_o + 5e-6*input_tokens # opus $25/$5 per Mtokscore = p_o - p_h # routing scorecost_aware = score / max(cost_o - cost_h, 1e-6)return dict(p_haiku=p_h, p_opus=p_o, pred_out_h=out_h, pred_out_o=out_o,pred_cost_h=cost_h, pred_cost_o=cost_o, score=score, cost_aware=cost_aware)print(route("What is 17 * 23?"))
Route to opus when score (or cost_aware, under a budget) exceeds a threshold
swept on your validation set.
Model provider
youngryankim
Model tree
Base
Qwen/Qwen3.5-0.8B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information