Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Results: base vs v1 vs v2 (real-defect eval, n=32)
16 real CVE-grade defects (advisory fix commits, inverted so the diff reintroduces the vuln; objective ground truth) + 16 matched clean fixes. Same base weights, LoRA hot-swapped, temperature 0.
| Metric | Base | v1 | v2 |
|---|---|---|---|
| Verdict accuracy | 71.9% | 59.4% | 78.1% |
| Positive recall (caught the real defect) | 87.5% (14/16) | 18.8% (3/16) | 56.2% (9/16) |
| Negative specificity (quiet on clean) | 56.2% | 100% | 100% |
| Category match | 56.2% | — | 43.8% |
| Invalid JSON | 0/32 | 0/32 | 0/32 |
Honest read: v2 roughly tripled v1's real-defect recall without giving back specificity, and has the
best overall verdict accuracy. It is not strictly better than base — base still out-recalls it
(14/16 vs 9/16) on subtle logic bypasses, and v2's category labelling regressed. But base false-alarms
on 7 of 16 clean fixes (specificity 56%), where v1 and v2 raise zero. Pick v2 for a low-false-positive
pipeline; pick base if you want maximum recall and will triage the noise. Full report with verbatim
side-by-side outputs (wins and losses) ships in the project repo under docs/eval/.
Training data
v1's 400 pairs + 38 real security positives (inverted SA-CORE fix commits, objective category/severity from the advisory) + matched clean negatives + 11 low-severity contrastive pairs (e.g. O(n²) array_merge-in-loop with a near-miss clean form). 498 train rows; the real-defect eval set was held out by advisory ID. Teacher for the synthetic half: Claude Opus 4.x.
Usage (with the base model)
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelbase = "Qwen/Qwen3-Coder-30B-A3B-Instruct"tok = AutoTokenizer.from_pretrained(base)m = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="bfloat16")m = PeftModel.from_pretrained(m, "bartek-flp/qwen3coder-30b-dcr-lora-v2")
Prompt with the DCR system message (review a diff, output JSON findings only).
Limitations
QLoRA on attention projections only (q/k/v/o, r=16). Real-defect recall is 56%, with the remaining gap mostly subtle logic-level access bypasses that the base model catches but v2 does not. Category labelling is weaker than base. The eval is small (n=32) and security-skewed. Always keep a human in the loop for security findings.
Model provider
bartek-flp
Model tree
Base
Qwen/Qwen3-Coder-30B-A3B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information