Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Current Proof Gate
Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue
Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.
| Phase | Greedy pass@1 | Coverage@K | Selected@K | Repair | Final |
|---|---|---|---|---|---|
| Qwen2.5-Coder-7B reference harness | 37/50 | 40/50 | 40/50 | 2/50 | 42/50 |
| v5 7B adapter/merged primary | 37/50 | 42/50 | 42/50 | 2/50 | 44/50 |
| 14B rescue on primary misses | 1/6 | 3/6 | 3/6 | 1/6 | 4/6 |
| v5 combined rescue system | 38/50 | 45/50 | 45/50 | 3/50 | 48/50 |
Lift Summary
The 7B merged model alone improved the final harness score from 42/50 to 44/50.
That is:
+2/50absolute tasks.+4percentage points.+4.76%relative improvement over the42/50reference.
The full v5 rescue system improved from 42/50 to 48/50.
That is:
+6/50absolute tasks.+12percentage points.+14.29%relative improvement.75%failure reduction, from8failures to2failures.
Interpretation
This model should be viewed as a compact coding component, not a frontier-model replacement by itself.
The practical artifact is:
- 7B merged model for primary code generation.
- Deterministic verifier/test runner.
- Candidate selection by executable tests.
- Repair pass for failed candidates.
- Optional rescue model for missed tasks.
The strongest result requires the harness.
Limitations
- The current proof gate is small.
- HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
- No broad SWE-bench claim is made.
- No Claude Sonnet 4.5 win is claimed.
- Contamination risk must be handled carefully on common public coding benchmarks.
Required Next Benchmarks
Future claims should be gated by a broader eval suite:
- LiveCodeBench, using recent and non-training-contaminated slices.
- BigCodeBench, including realistic library/function behavior.
- SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
- Repo-edit tasks with hidden tests.
- Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
- Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
- Abstention and invalid-output rates.
- Robustness under strict code-only output constraints.
Batch-Based Release Discipline
The next iteration should avoid giant all-in-one notebooks.
Preferred release process:
baseline: evaluate base model only.candidate: evaluate one candidate change only.failure_forge: collect failed attempts and verifier observations.repair_train: train only on verified minimal repairs.heldout_eval: rerun held-out benchmark tasks.release: push LoRA, merged model, and GGUF only after the gate passes.
Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.
Basic Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(model_id,device_map="auto",trust_remote_code=True,)
For meaningful results, run the model in a verifier harness rather than judging raw single responses.
Model provider
josephmayo
Model tree
Base
Qwen/Qwen2.5-Coder-7B-Instruct
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information