Hanno-Labs
bosun-4b
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Changelog
v1.1 — broader general judgment (current)
Same architecture and inference contract as v1.0; retrained on an expanded blend (DialAM-2024 argument edges, NLI, PAWS, e-CARE/COPA causal, dedup hard-negatives, completeness, and synthetic directional data, on top of v1.0). Still one model, programmed by a sentence — no per-task fine-tuning.
New: directional & typed-edge judgment — supersession ("B replaces A"), depends-on, supports / contradicts. Bosun now reads the ordered pair for asymmetric relations, not just symmetric similarity.
Generality on held-out public benchmarks (one instruction each), vs a frontier LLM on the same items:
| benchmark | Bosun-4B v1.1 | gemini-3.1-flash-lite | similarity baseline | fine-tuned specialist |
|---|---|---|---|---|
| PAWS (adversarial paraphrase) | 0.91 | 0.81 | ~chance (0.53 AUROC) | ~0.95 (DeBERTa) |
| e-CARE (causal direction) | 0.85 | 0.86 | 0.60 | ~0.75 (paper) |
| ANLI (adversarial NLI) | 0.57 | 0.74 | 0.33 | ~0.69 |
Bosun-4B beats gemini-3.1-flash-lite on PAWS, ties it on e-CARE, and trails on ANLI — while crushing it on steerable judgment (WarrantBench 0.945 vs 0.575). Edge curation (DialAM-2024): recall 0.71, beating Sonnet on recall + precision.
No regression: FollowIR flat vs v1.0; WarrantBench steerability 0.885 → 0.945.
v1.0 — launch
Symmetric programmable judge. WarrantBench steerability 0.885; FollowIR state-of-the-art (+17.9 p-MRR).
Inference contract
Native Qwen3-Reranker template; read the last-token logits:
markdown
<Instruct>: <your rule, e.g. "Connected only if the two findings share a specific named entity."><Query>: These two findings share the specified relationship.<Document>: FINDING A:\n<text_a>\n\nFINDING B:\n<text_b>
score = sigmoid(logits[yes_id] - logits[no_id]) at the final position (logits_to_keep=1). The exact
yes_id / no_id / template prefix+suffix and max_len are in serving.json.
python
import torchfrom transformers import AutoTokenizer, AutoModelForCausalLMfrom peft import PeftModelrepo = "Hanno-Labs/bosun-4b"cfg = ... # serving.json from this repotok = AutoTokenizer.from_pretrained(repo, subfolder="tokenizer", padding_side="left")base = AutoModelForCausalLM.from_pretrained(cfg["base_model"], torch_dtype=torch.bfloat16,attn_implementation="sdpa", trust_remote_code=True)model = PeftModel.from_pretrained(base, repo).merge_and_unload().eval().cuda()# build ids = prefix + <Instruct/Query/Document> + suffix, then:# lg = model(input_ids, attention_mask, logits_to_keep=1).logits[:, -1, :]# p = torch.sigmoid(lg[:, cfg["yes_id"]] - lg[:, cfg["no_id"]])
Run locally (GGUF / llama.cpp)
CPU / Apple-Silicon / edge builds (f16, Q8_0, Q4_K_M — all calibration-safe at 4B) live at Hanno-Labs/bosun-4b-GGUF.
⚠️ Do not use llama.cpp's --rerank mode — it silently discards the <Instruct> and returns
degenerate, instruction-blind scores. Use the completion + logits path documented in that repo
(validated per-pair against this model's transformers reference — Q8_0 within ~0.001).
Results
Bosun-4B is state-of-the-art on FollowIR (public instruction-following retrieval), averaging +17.9 p-MRR on the full pool — it changes its judgments correctly when the instruction changes, where most retrievers move the wrong way. On a capped pool it matches gemini-3.1-flash-lite head-to-head (12.0 = 12.0) at a fraction of the cost.
WarrantBench (Hanno-Labs/warrantbench): follows arbitrary rules and their negations, and flips correctly on steerability triples. The 4B capacity closes the hardest-slice gap to the frontier LLM that the 0.6B leaves open.
Files
| file | what |
|---|---|
adapter_model.safetensors, adapter_config.json | the LoRA adapter (load with PEFT over the base) |
serving.json | inference contract: template + yes_id/no_id + max_len |
tokenizer/ | Qwen tokenizer (left-padding) |
Links
- Launch post — Introducing Bosun
- GGUF (run locally) — Hanno-Labs/bosun-4b-GGUF
- WarrantBench — github.com/Hanno-Labs/warrantbench (dataset)
From Hanno Labs.
Model provider
Hanno-Labs
Model tree
Base
Qwen/Qwen3-Reranker-4B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information