Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Current Proof Gate

Kaggle proof kernel: holykeys/qwen25-coder-agentic-slm-v5-rescue

Evaluation set: 50 HumanEval/MBPP-style tasks used for fast iteration.

PhaseGreedy pass@1Coverage@KSelected@KRepairFinal
Qwen2.5-Coder-7B reference harness37/5040/5040/502/5042/50
v5 7B adapter/merged primary37/5042/5042/502/5044/50
14B rescue on primary misses1/63/63/61/64/6
v5 combined rescue system38/5045/5045/503/5048/50

Lift Summary

The 7B merged model alone improved the final harness score from 42/50 to 44/50.

That is:

  • +2/50 absolute tasks.
  • +4 percentage points.
  • +4.76% relative improvement over the 42/50 reference.

The full v5 rescue system improved from 42/50 to 48/50.

That is:

  • +6/50 absolute tasks.
  • +12 percentage points.
  • +14.29% relative improvement.
  • 75% failure reduction, from 8 failures to 2 failures.

Interpretation

This model should be viewed as a compact coding component, not a frontier-model replacement by itself.

The practical artifact is:

  • 7B merged model for primary code generation.
  • Deterministic verifier/test runner.
  • Candidate selection by executable tests.
  • Repair pass for failed candidates.
  • Optional rescue model for missed tasks.

The strongest result requires the harness.

Limitations

  • The current proof gate is small.
  • HumanEval/MBPP-style tasks are not enough to establish broad coding-agent quality.
  • No broad SWE-bench claim is made.
  • No Claude Sonnet 4.5 win is claimed.
  • Contamination risk must be handled carefully on common public coding benchmarks.

Required Next Benchmarks

Future claims should be gated by a broader eval suite:

  • LiveCodeBench, using recent and non-training-contaminated slices.
  • BigCodeBench, including realistic library/function behavior.
  • SWE-bench Lite, then SWE-bench Verified if the lite run is promising.
  • Repo-edit tasks with hidden tests.
  • Agentic tool-use tasks: edit, run tests, inspect failures, patch again.
  • Cost and latency: total wall-clock, GPU type, tokens per task, repair count, and success per dollar.
  • Abstention and invalid-output rates.
  • Robustness under strict code-only output constraints.

Batch-Based Release Discipline

The next iteration should avoid giant all-in-one notebooks.

Preferred release process:

  1. baseline: evaluate base model only.
  2. candidate: evaluate one candidate change only.
  3. failure_forge: collect failed attempts and verifier observations.
  4. repair_train: train only on verified minimal repairs.
  5. heldout_eval: rerun held-out benchmark tasks.
  6. release: push LoRA, merged model, and GGUF only after the gate passes.

Each batch should have a separate Kaggle notebook, capped runtime, deterministic output files, and explicit pass/fail criteria.

Basic Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "josephmayo/Qwen2.5-Coder-7B-agentic-SLM"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)

For meaningful results, run the model in a verifier harness rather than judging raw single responses.

Model provider

josephmayo

Model tree

Base

Qwen/Qwen2.5-Coder-7B-Instruct

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today