Why
The base model under-reports issues on real Drupal merge requests (high precision,
low recall). This adapter is trained on a hybrid distillation set so the model
both catches real Drupal anti-patterns (synthetic positives) and stays quiet on
clean code (real merged-MR negatives).
Training data
400 teacher-labeled pairs (distillation_v1): 251 positive / 149 negative.
- Positives — 243 synthetic across 26 Drupal anti-patterns (SQLi, XSS sinks,
CSRF-on-GET, broken DI / missing
create(), accessCheck() omissions, recursion
in presave, deprecated APIs, etc.) + 7 real merge-request bugs.
- Negatives — 149 clean, merged MRs from webform, paragraphs, drupal core,
pathauto, commerce, search_api (teacher-verified clean).
Teacher: Claude Opus 4.x. Each pair is
(diff → JSON verdict + findings).
Usage (with the base model)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
m = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="bfloat16")
m = PeftModel.from_pretrained(m, "bartek-flp/qwen3coder-30b-dcr-lora")
Prompt with the DCR system message (review a diff, output JSON findings only).
Results (A/B vs base, held-out val)
48 held-out pairs the adapter never saw (27 with a defect, 21 clean), temperature 0, served as the same base weights with the LoRA hot-swapped, so only the training differs.
Table with columns: Metric (n=48), Base, Tuned| Metric (n=48) | Base | Tuned |
|---|
| Verdict accuracy | 83.3% (40/48) | 95.8% (46/48) |
| Positive recall | 81.5% (22/27) | 92.6% (25/27) |
| Negative specificity | 85.7% (18/21) | 100% (21/21) |
| Category match | 40.7% (11/27) | 63.0% (17/27) |
| Invalid JSON | 4/48 | 0/48 |
Honest read: fine-tuning mostly bought reliability and calibration, not raw bug-finding. The base model already detects most issues, but on 4 positives it emitted unparseable JSON (often a stray \Drupal\ backslash) and on 3 clean diffs it raised false alarms. The adapter always returns valid JSON, holds 100% specificity, and names categories better. The cost: it missed two low-severity O(n²) array_merge-in-loop bugs the base model caught. A full report with verbatim side-by-side outputs, covering both the wins and the losses, ships in the project repo under docs/eval/.
Limitations
QLoRA on attention projections only; tuned for diff review, not general chat. The
synthetic positives teach patterns, not every real-world manifestation. Always keep
a human in the loop for security findings.