Hanno-Labs

bosun-4b

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Changelog

v1.1 — broader general judgment (current)

Same architecture and inference contract as v1.0; retrained on an expanded blend (DialAM-2024 argument edges, NLI, PAWS, e-CARE/COPA causal, dedup hard-negatives, completeness, and synthetic directional data, on top of v1.0). Still one model, programmed by a sentence — no per-task fine-tuning.

New: directional & typed-edge judgment — supersession ("B replaces A"), depends-on, supports / contradicts. Bosun now reads the ordered pair for asymmetric relations, not just symmetric similarity.

Generality on held-out public benchmarks (one instruction each), vs a frontier LLM on the same items:

Table
benchmarkBosun-4B v1.1gemini-3.1-flash-litesimilarity baselinefine-tuned specialist
PAWS (adversarial paraphrase)0.910.81~chance (0.53 AUROC)~0.95 (DeBERTa)
e-CARE (causal direction)0.850.860.60~0.75 (paper)
ANLI (adversarial NLI)0.570.740.33~0.69

Bosun-4B beats gemini-3.1-flash-lite on PAWS, ties it on e-CARE, and trails on ANLI — while crushing it on steerable judgment (WarrantBench 0.945 vs 0.575). Edge curation (DialAM-2024): recall 0.71, beating Sonnet on recall + precision.

No regression: FollowIR flat vs v1.0; WarrantBench steerability 0.885 → 0.945.

v1.0 — launch

Symmetric programmable judge. WarrantBench steerability 0.885; FollowIR state-of-the-art (+17.9 p-MRR).

Inference contract

Native Qwen3-Reranker template; read the last-token logits:

markdown

<Instruct>: <your rule, e.g. "Connected only if the two findings share a specific named entity.">
<Query>: These two findings share the specified relationship.
<Document>: FINDING A:\n<text_a>\n\nFINDING B:\n<text_b>

score = sigmoid(logits[yes_id] - logits[no_id]) at the final position (logits_to_keep=1). The exact yes_id / no_id / template prefix+suffix and max_len are in serving.json.

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
repo = "Hanno-Labs/bosun-4b"
cfg = ... # serving.json from this repo
tok = AutoTokenizer.from_pretrained(repo, subfolder="tokenizer", padding_side="left")
base = AutoModelForCausalLM.from_pretrained(cfg["base_model"], torch_dtype=torch.bfloat16,
attn_implementation="sdpa", trust_remote_code=True)
model = PeftModel.from_pretrained(base, repo).merge_and_unload().eval().cuda()
# build ids = prefix + <Instruct/Query/Document> + suffix, then:
# lg = model(input_ids, attention_mask, logits_to_keep=1).logits[:, -1, :]
# p = torch.sigmoid(lg[:, cfg["yes_id"]] - lg[:, cfg["no_id"]])

Run locally (GGUF / llama.cpp)

CPU / Apple-Silicon / edge builds (f16, Q8_0, Q4_K_M — all calibration-safe at 4B) live at Hanno-Labs/bosun-4b-GGUF.

⚠️ Do not use llama.cpp's --rerank mode — it silently discards the <Instruct> and returns degenerate, instruction-blind scores. Use the completion + logits path documented in that repo (validated per-pair against this model's transformers reference — Q8_0 within ~0.001).

Results

Bosun-4B is state-of-the-art on FollowIR (public instruction-following retrieval), averaging +17.9 p-MRR on the full pool — it changes its judgments correctly when the instruction changes, where most retrievers move the wrong way. On a capped pool it matches gemini-3.1-flash-lite head-to-head (12.0 = 12.0) at a fraction of the cost.

WarrantBench (Hanno-Labs/warrantbench): follows arbitrary rules and their negations, and flips correctly on steerability triples. The 4B capacity closes the hardest-slice gap to the frontier LLM that the 0.6B leaves open.

Files

Table
filewhat
adapter_model.safetensors, adapter_config.jsonthe LoRA adapter (load with PEFT over the base)
serving.jsoninference contract: template + yes_id/no_id + max_len
tokenizer/Qwen tokenizer (left-padding)

From Hanno Labs.

Model provider

Hanno-Labs

Model tree

Base

Qwen/Qwen3-Reranker-4B

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today