What Changed
- Base model:
Qwen/Qwen2.5-Coder-1.5B-Instruct
- Training method: QLoRA/LoRA adapter
- Hardware: Kaggle
2x Tesla T4
- Training budget:
140 steps, 1721 train rows after filtering
- Data description: manually curated coding data mixed with publicly available coding instruction data. Dataset names and training rows are intentionally not included in this repo.
Same-Size Proof
This comparison is against the same base model and same parameter class: Qwen/Qwen2.5-Coder-1.5B-Instruct before training versus this adapter on top of that base.
Evaluation: 50 HumanEval tasks + 50 MBPP tasks.
Table with columns: Metric, Base Greedy, Forge SLM Adapter + Sampling/Repair| Metric | Base Greedy | Forge SLM Adapter + Sampling/Repair |
|---|
| Total pass | 45 / 100 | 53 / 100 |
| HumanEval | 41 / 50 | 45 / 50 |
| MBPP | 4 / 50 | 8 / 50 |
| Absolute lift | - | +8.0 percentage points |
| Relative pass-count lift | - | +17.78% |
This is not yet a claim of beating frontier models. It is a same-size proof that the SLM adapter plus execution-selected sampling/repair moved the 1.5B coding base upward on two standard coding eval subsets.
Proof Files
See proofs/:
eval_before_after_full_code.csv: raw generations, extracted code, pass/fail, and errors.
before_greedy_full_code.csv: baseline greedy generations.
release_summary_sanitized.json: run metrics and config with dataset names redacted.
trainer_log_history.json: training logs.
nvidia_smi.txt: Kaggle GPU proof.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_id = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
adapter_id = "josephmayo/Qwen2.5-Coder-1.5B-Forge-SLM"
tokenizer = AutoTokenizer.from_pretrained(adapter_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", torch_dtype="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()
For benchmark-style tasks, use strict code-only prompting and run generated code against tests. The reported after score uses sampling/repair, not just single greedy decoding.
Limitations
- This is an adapter release, not a merged full-weight model.
- The eval is a 100-task subset: 50 HumanEval + 50 MBPP.
- The after score uses adapter + sampling/repair, so it should be compared to agentic coding usage rather than pure greedy decoding.
- Training data is described but not published in this repo.
Evidence files
Run evidence for this release is stored in the repository under evidence/:
These files are compact local/Kaggle run artifacts used to document training, evaluation, merge, or quantization evidence for this model family.