Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Architecture (multi-head)

markdown

Qwen3.5-0.8B (LoRA) → last-token hidden h[1024]
├── routing_head Linear(1024→2) → sigmoid → (p_haiku, p_opus) # BCE
└── token_head Linear(1024→2) → z-scored log1p(output tokens) # MSE

This repo holds the LoRA adapter (adapter_model.safetensors) plus the two head weights and the token-target normalization stats in heads.pt (routing_head, token_head, hidden, tok_mean, tok_std).

Metrics (600 held-out queries)

  • routing AUC (B detection): 0.75
  • opus cost-prediction corr: 0.50
  • cost-aware curve: +4–6 pp accuracy over random routing at matched budget

Usage

python

import torch, torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
REPO = "youngryankim/qwen3.5-0.8b-cost-aware-router"
tok = AutoTokenizer.from_pretrained(REPO)
backbone = AutoModel.from_pretrained("Qwen/Qwen3.5-0.8B", dtype=torch.bfloat16)
backbone = PeftModel.from_pretrained(backbone, REPO).eval().cuda()
heads = torch.load(hf_hub_download(REPO, "heads.pt"), map_location="cuda")
rh = nn.Linear(heads["hidden"], 2).bfloat16().cuda(); rh.load_state_dict(heads["routing_head"]); rh.eval()
th = nn.Linear(heads["hidden"], 2).bfloat16().cuda(); th.load_state_dict(heads["token_head"]); th.eval()
mean, std = heads["tok_mean"], heads["tok_std"]
SYS = ("You are a routing model. Read the user query and assess which model can "
"answer it and how long each answer will be.")
@torch.no_grad()
def route(query, input_tokens=200):
enc = tok.apply_chat_template([{"role":"system","content":SYS},
{"role":"user","content":query}],
add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt").to("cuda")
h = backbone(**enc).last_hidden_state[:, -1, :]
p_h, p_o = torch.sigmoid(rh(h).float())[0].tolist()
t = th(h).float()[0].tolist()
out_h = torch.expm1(torch.tensor(t[0]*std[0]+mean[0])).item()
out_o = torch.expm1(torch.tensor(t[1]*std[1]+mean[1])).item()
cost_h = 5e-6*out_h + 1e-6*input_tokens # haiku $5/$1 per Mtok
cost_o = 25e-6*out_o + 5e-6*input_tokens # opus $25/$5 per Mtok
score = p_o - p_h # routing score
cost_aware = score / max(cost_o - cost_h, 1e-6)
return dict(p_haiku=p_h, p_opus=p_o, pred_out_h=out_h, pred_out_o=out_o,
pred_cost_h=cost_h, pred_cost_o=cost_o, score=score, cost_aware=cost_aware)
print(route("What is 17 * 23?"))

Route to opus when score (or cost_aware, under a budget) exceeds a threshold swept on your validation set.

Model provider

youngryankim

Model tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today