josephmayo

Qwen2.5-agentic-7B-SLM

README

License: apache-2.0

Current Proof Gate

Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue

Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.

Table with columns: Phase, Greedy pass@1, Coverage@K, Selected@K, Repair, Final
Phase	Greedy pass@1	Coverage@K	Selected@K	Repair	Final
Qwen2.5-Coder-7B reference harness	37/50	40/50	40/50	2/50	42/50
v5 7B adapter/merged primary	37/50	42/50	42/50	2/50	44/50
14B rescue on primary misses	1/6	3/6	3/6	1/6	4/6
v5 combined rescue system	38/50	45/50	45/50	3/50	48/50

Lift Summary

The 7B merged model alone improved the final harness score from 42/50 to 44/50.

That is:

+2/50 absolute tasks.
+4 percentage points.
+4.76% relative improvement over the 42/50 reference.

The full v5 rescue system improved from 42/50 to 48/50.

That is:

+6/50 absolute tasks.
+12 percentage points.
+14.29% relative improvement.
75% failure reduction, from 8 failures to 2 failures.

Interpretation

This model should be viewed as a compact coding component, not a frontier-model replacement by itself.

The practical artifact is:

7B merged model for primary code generation.
Deterministic verifier/test runner.
Candidate selection by executable tests.
Repair pass for failed candidates.
Optional rescue model for missed tasks.

The strongest result requires the harness.

Limitations

The current proof gate is small.
HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
No broad SWE-bench claim is made.
No Claude Sonnet 4.5 win is claimed.
Contamination risk must be handled carefully on common public coding benchmarks.

Required Next Benchmarks

Future claims should be gated by a broader eval suite:

LiveCodeBench, using recent and non-training-contaminated slices.
BigCodeBench, including realistic library/function behavior.
SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
Repo-edit tasks with hidden tests.
Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
Abstention and invalid-output rates.
Robustness under strict code-only output constraints.

Batch-Based Release Discipline

The next iteration should avoid giant all-in-one notebooks.

Preferred release process:

baseline: evaluate base model only.
candidate: evaluate one candidate change only.
failure_forge: collect failed attempts and verifier observations.
repair_train: train only on verified minimal repairs.
heldout_eval: rerun held-out benchmark tasks.
release: push LoRA, merged model, and GGUF only after the gate passes.

Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.

Basic Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

For meaningful results, run the model in a verifier harness rather than judging raw single responses.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

josephmayo

Model Tree

Base

Qwen/Qwen2.5-Coder-7B-Instruct

Fine-tuned

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Current Proof Gate

Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue

Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.

Table with columns: Phase, Greedy pass@1, Coverage@K, Selected@K, Repair, Final
Phase	Greedy pass@1	Coverage@K	Selected@K	Repair	Final
Qwen2.5-Coder-7B reference harness	37/50	40/50	40/50	2/50	42/50
v5 7B adapter/merged primary	37/50	42/50	42/50	2/50	44/50
14B rescue on primary misses	1/6	3/6	3/6	1/6	4/6
v5 combined rescue system	38/50	45/50	45/50	3/50	48/50

Lift Summary

The 7B merged model alone improved the final harness score from 42/50 to 44/50.

That is:

+2/50 absolute tasks.
+4 percentage points.
+4.76% relative improvement over the 42/50 reference.

The full v5 rescue system improved from 42/50 to 48/50.

That is:

+6/50 absolute tasks.
+12 percentage points.
+14.29% relative improvement.
75% failure reduction, from 8 failures to 2 failures.

Interpretation

This model should be viewed as a compact coding component, not a frontier-model replacement by itself.

The practical artifact is:

7B merged model for primary code generation.
Deterministic verifier/test runner.
Candidate selection by executable tests.
Repair pass for failed candidates.
Optional rescue model for missed tasks.

The strongest result requires the harness.

Limitations

The current proof gate is small.
HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
No broad SWE-bench claim is made.
No Claude Sonnet 4.5 win is claimed.
Contamination risk must be handled carefully on common public coding benchmarks.

Required Next Benchmarks

Future claims should be gated by a broader eval suite:

LiveCodeBench, using recent and non-training-contaminated slices.
BigCodeBench, including realistic library/function behavior.
SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
Repo-edit tasks with hidden tests.
Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
Abstention and invalid-output rates.
Robustness under strict code-only output constraints.

Batch-Based Release Discipline

The next iteration should avoid giant all-in-one notebooks.

Preferred release process:

baseline: evaluate base model only.
candidate: evaluate one candidate change only.
failure_forge: collect failed attempts and verifier observations.
repair_train: train only on verified minimal repairs.
heldout_eval: rerun held-out benchmark tasks.
release: push LoRA, merged model, and GGUF only after the gate passes.

Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.

Basic Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

For meaningful results, run the model in a verifier harness rather than judging raw single responses.