Current Proof Gate
Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue
Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.
Table with columns: Phase, Greedy pass@1, Coverage@K, Selected@K, Repair, Final| Phase | Greedy pass@1 | Coverage@K | Selected@K | Repair | Final |
|---|
| Qwen2.5-Coder-7B reference harness | 37/50 | 40/50 | 40/50 | 2/50 | 42/50 |
| v5 7B adapter/merged primary | 37/50 | 42/50 | 42/50 | 2/50 | 44/50 |
| 14B rescue on primary misses | 1/6 | 3/6 | 3/6 | 1/6 | 4/6 |
| v5 combined rescue system | 38/50 | 45/50 | 45/50 | 3/50 | 48/50 |
Lift Summary
The 7B merged model alone improved the final harness score from 42/50 to 44/50.
That is:
+2/50 absolute tasks.
+4 percentage points.
+4.76% relative improvement over the 42/50 reference.
The full v5 rescue system improved from 42/50 to 48/50.
That is:
+6/50 absolute tasks.
+12 percentage points.
+14.29% relative improvement.
75% failure reduction, from 8 failures to 2 failures.
Interpretation
This model should be viewed as a compact coding component, not a frontier-model replacement by itself.
The practical artifact is:
- 7B merged model for primary code generation.
- Deterministic verifier/test runner.
- Candidate selection by executable tests.
- Repair pass for failed candidates.
- Optional rescue model for missed tasks.
The strongest result requires the harness.
Limitations
- The current proof gate is small.
- HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
- No broad SWE-bench claim is made.
- No Claude Sonnet 4.5 win is claimed.
- Contamination risk must be handled carefully on common public coding benchmarks.
Required Next Benchmarks
Future claims should be gated by a broader eval suite:
- LiveCodeBench, using recent and non-training-contaminated slices.
- BigCodeBench, including realistic library/function behavior.
- SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
- Repo-edit tasks with hidden tests.
- Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
- Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
- Abstention and invalid-output rates.
- Robustness under strict code-only output constraints.
Batch-Based Release Discipline
The next iteration should avoid giant all-in-one notebooks.
Preferred release process:
baseline: evaluate base model only.
candidate: evaluate one candidate change only.
failure_forge: collect failed attempts and verifier observations.
repair_train: train only on verified minimal repairs.
heldout_eval: rerun held-out benchmark tasks.
release: push LoRA, merged model, and GGUF only after the gate passes.
Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
For meaningful results, run the model in a verifier harness rather than judging raw single responses.