Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Evaluation (post-fix, 3-judge panel)
Mean score (0–100) on 15 held-out prompts, graded by Claude Opus 4.7, GPT-5.5, and a
local Qwen-3B (gpt-oss experts is a deliberately un-retrained stale control):
| model | Claude | GPT-5.5 | Qwen-3B | Avg |
|---|---|---|---|---|
| gpt-5.5 (frontier ceiling) | 94.6 | 95.6 | 90.8 | 93.7 |
| gpt-oss attn (retrained teacher) | 82.0 | 66.8 | 81.4 | 76.7 |
| qwen-0.5b distilled (served) | 79.0 | 68.6 | 82.2 | 76.6 |
| qwen-0.5b direct 7k (served) | 78.6 | 64.4 | 82.0 | 75.0 |
| gpt-oss experts (stale control) | 67.6 | 68.6 | 81.8 | 72.7 |
| qwen-3b base | 62.1 | 67.1 | 80.5 | 69.9 |
| gpt-oss base | 55.4 | 53.8 | 68.2 | 59.1 |
| qwen-0.5b base | 36.5 | 44.5 | 67.9 | 49.7 |
Both served retrained 0.5Bs beat the stale control and every untuned base across all three judges, and the distilled 0.5B ≈ ties its own 20B teacher.
Limitations
- 0.5B capacity; prompt-format-frozen (see below). A purpose-built ProofKit component.
About ProofKit
ProofKit is a work-sample generator for job seekers — it turns a target role, background, and skills-to-prove into a realistic, clearly-fictional practice work sample (a role-specific challenge, a guided builder, a readiness review, and a recruiter-ready portfolio packet). Built for the Hugging Face Build Small Hackathon (Backyard AI track). Integrity rules are load-bearing: outputs never claim real employment, metrics are labeled hypothetical, and exports carry an ethical disclosure.
The ProofKit model family
| Repo | What it is |
|---|---|
visproj/proofkit-qwen0.5b-7k | Qwen2.5-0.5B fine-tuned directly on the 7k set (Transformers) |
visproj/proofkit-gpt-oss-20b-lora | gpt-oss-20b LoRA — the distillation teacher |
visproj/proofkit-distilled-qwen0.5b | Qwen2.5-0.5B distilled from the teacher (merged) |
visproj/proofkit-distilled-qwen0.5b-gguf | GGUF of the distilled student (llama.cpp — served) |
visproj/proofkit-sft | SFT dataset (synthetic, license-safe) |
visproj/proofkit-distill-qwen0.5b | Distillation dataset (teacher completions) |
A note on training data (the "static responses" fix)
An earlier version of these models produced repetitive, input-ignoring drafts. The
root cause was synthetic-data leakage: the dataset rendered the example user
answers and the target from the same template slots, so the model learned
target = template instead of target = f(input). The fix — faithfulness anchors
(a distinctive token shared by the answer and the target) + seeded per-example
variation across every task, then a full-chain retrain — is what these current
weights reflect.
Prompt format is a frozen contract
These 0.5B models were trained on the exact prompt shapes from ProofKit's
prompt_formats.py. They only behave well when prompted in that format; reworded or
free-form prompts push them off-distribution. They are purpose-built components of the
ProofKit app, not general chat models.
Model provider
build-small-hackathon
Model tree
Base
Qwen/Qwen2.5-0.5B-Instruct
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information