Table with columns: Benchmark, Accuracy, StdErr, Completed, Return code
Benchmark
Accuracy
StdErr
Completed
Return code
humaneval
0.0
0.0
164/164
0
mbpp
0.0
0.0
1285/1285
0
Raw Inspect JSON logs and compact summaries are in evals/.
Known Eval Limits
Inspect evals used a custom Gemma chat template passed through -M chat_template=... because the stock Inspect HF provider handed ChatMessage objects to a dict-oriented tokenizer template.
Inspect evals used max_connections=8 on l4x1 to complete the full MBPP run within the HF Job timeout.
Adapter selection loss was recomputed by adapter-only recovery jobs because the original sweep jobs pushed adapters but crashed in the Trackio callback before writing eval_results.json.
The README table reports Inspect accuracy and stderr; it does not publish separate pass@k columns.
Inspect used the local Hugging Face Transformers backend on l4x1 with max_connections=8, not vLLM throughput.
The eval sandbox is inside the HF Job, not Docker; this is less isolated than leaderboard-grade Docker execution.
Dataset and Privacy Notes
The source dataset was exported with
pi-share-hf, including deterministic
redaction, deny-pattern filtering, TruffleHog scanning, and LLM review before
upload. This run consumes the published redacted dataset only.
The SFT converter:
splits by session file before extracting examples;
strips assistant thinking blocks and thinking signatures;
represents tool calls as text;
folds tool results into user context;
omits image, audio, and video payloads for text-only training.
Reproducibility
The generated child scripts used by the controller are stored under
run_scripts/. The article-style write-up is in ARTICLE.md.