Current Proof Gate
Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue
Evaluation set: 50 HumanEval/MBPP-style coding tasks used for fast iteration.
Table with columns: Phase, Greedy pass@1, Coverage@K, Selected@K, Repair, Final| Phase | Greedy pass@1 | Coverage@K | Selected@K | Repair | Final |
|---|
| Qwen2.5-Coder-7B reference harness | 37/50 | 40/50 | 40/50 | 2/50 | 42/50 |
| v5 7B adapter primary | 37/50 | 42/50 | 42/50 | 2/50 | 44/50 |
| 14B rescue on primary misses | 1/6 | 3/6 | 3/6 | 1/6 | 4/6 |
| v5 combined rescue system | 38/50 | 45/50 | 45/50 | 3/50 | 48/50 |
Lift
Against the Qwen2.5-Coder-7B reference harness result of 42/50:
- LoRA-only primary system:
44/50, a +2/50 absolute improvement.
- LoRA-only percentage-point lift:
+4 points.
- LoRA-only relative lift:
+4.76%.
- Full v5 rescue system:
48/50, a +6/50 absolute improvement.
- Full system percentage-point lift:
+12 points.
- Full system relative lift:
+14.29%.
- Failure reduction: from
8 misses to 2 misses, a 75% reduction in failures on this gate.
The honest conclusion: the LoRA alone is a small gain. The meaningful progress is from the deterministic verifier/rescue system.
What This Is Not
This is not a claimed Claude Sonnet 4.5 replacement.
This is not a broad SWE-bench win.
This is not a proof that the raw 7B weights beat frontier models.
The release is a reproducible intermediate artifact: a compact coding model plus a verifier-oriented harness that shows a measurable improvement on a fast gate.
Required Next Benchmarks
The current gate is intentionally small. It is useful for fast iteration only. Before making larger claims, the next evaluation batch must include:
- LiveCodeBench: fresh contest-style coding problems, preferably recent slices only.
- BigCodeBench: broader function-level and library-use coding tasks.
- SWE-bench Lite or Verified subset: repository patching with real tests.
- Agentic edit tasks: file editing, test execution, patch generation, and repair loops.
- Cost and latency: wall-clock time, tokens generated, GPU class, and estimated dollar cost.
- Abstention rate: how often the system refuses to answer or returns no valid patch.
- Invalid-output rate: markdown leakage, missing entrypoint, syntax errors, test leakage, and prose leakage.
- Selector diagnostics: coverage@K, selected@K, selector gap, repair@1, and false-positive verifier selections.
Recommended Evaluation Policy
Do not push all training/eval/release work inside one notebook.
Use deterministic batches:
- Baseline batch: run the base model first, no training.
- Candidate batch: run the candidate model/harness on the exact same tasks.
- Failure batch: collect failed tasks, failed code, verifier output, and minimal repair.
- Repair batch: train or prompt only on verified repair data.
- Proof batch: rerun held-out tests immediately.
- Release batch: publish only if the proof gate beats the previous best.
Every batch should emit JSON summaries, task-level CSV, rollouts, error signatures, and environment metadata.
Files
adapter_model.safetensors: LoRA adapter.
adapter_config.json: PEFT configuration.
v5_rescue_release_summary.json: exact proof-run summary.
v5_rescue_eval_before_after_full_code.csv: task-level proof-run table.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_id = "Qwen/Qwen2.5-Coder-7B-Instruct"
adapter_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM-LoRA"
tokenizer = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, adapter_id)
For best results, use the model inside a strict code-only verifier harness. Do not evaluate it only by casual chat prompts.