bartek-flp

qwen3coder-30b-dcr-lora-v2

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results: base vs v1 vs v2 (real-defect eval, n=32)

16 real CVE-grade defects (advisory fix commits, inverted so the diff reintroduces the vuln; objective ground truth) + 16 matched clean fixes. Same base weights, LoRA hot-swapped, temperature 0.

Table with columns: Metric, Base, v1, v2
Metric	Base	v1	v2
Verdict accuracy	71.9%	59.4%	78.1%
Positive recall (caught the real defect)	87.5% (14/16)	18.8% (3/16)	56.2% (9/16)
Negative specificity (quiet on clean)	56.2%	100%	100%
Category match	56.2%	—	43.8%
Invalid JSON	0/32	0/32	0/32

Honest read: v2 roughly tripled v1's real-defect recall without giving back specificity, and has the best overall verdict accuracy. It is not strictly better than base — base still out-recalls it (14/16 vs 9/16) on subtle logic bypasses, and v2's category labelling regressed. But base false-alarms on 7 of 16 clean fixes (specificity 56%), where v1 and v2 raise zero. Pick v2 for a low-false-positive pipeline; pick base if you want maximum recall and will triage the noise. Full report with verbatim side-by-side outputs (wins and losses) ships in the project repo under docs/eval/.

Training data

v1's 400 pairs + 38 real security positives (inverted SA-CORE fix commits, objective category/severity from the advisory) + matched clean negatives + 11 low-severity contrastive pairs (e.g. O(n²) array_merge-in-loop with a near-miss clean form). 498 train rows; the real-defect eval set was held out by advisory ID. Teacher for the synthetic half: Claude Opus 4.x.

Usage (with the base model)

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
m = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="bfloat16")
m = PeftModel.from_pretrained(m, "bartek-flp/qwen3coder-30b-dcr-lora-v2")

Prompt with the DCR system message (review a diff, output JSON findings only).

Limitations

QLoRA on attention projections only (q/k/v/o, r=16). Real-defect recall is 56%, with the remaining gap mostly subtle logic-level access bypasses that the base model catches but v2 does not. Category labelling is weaker than base. The eval is small (n=32) and security-skewed. Always keep a human in the loop for security findings.

Model provider

bartek-flp

Model tree

Base

Qwen/Qwen3-Coder-30B-A3B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results: base vs v1 vs v2 (real-defect eval, n=32)

16 real CVE-grade defects (advisory fix commits, inverted so the diff reintroduces the vuln; objective ground truth) + 16 matched clean fixes. Same base weights, LoRA hot-swapped, temperature 0.

Table with columns: Metric, Base, v1, v2
Metric	Base	v1	v2
Verdict accuracy	71.9%	59.4%	78.1%
Positive recall (caught the real defect)	87.5% (14/16)	18.8% (3/16)	56.2% (9/16)
Negative specificity (quiet on clean)	56.2%	100%	100%
Category match	56.2%	—	43.8%
Invalid JSON	0/32	0/32	0/32

Training data

Usage (with the base model)

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
m = AutoModelForCausalLM.from_pretrained(base, device_map="auto", torch_dtype="bfloat16")
m = PeftModel.from_pretrained(m, "bartek-flp/qwen3coder-30b-dcr-lora-v2")

Prompt with the DCR system message (review a diff, output JSON findings only).

qwen3coder-30b-dcr-lora-v2

Get help setting up a custom Dedicated Endpoints.

README

Results: base vs v1 vs v2 (real-defect eval, n=32)

Training data

Usage (with the base model)

Limitations

Explore FriendliAI today

README

Results: base vs v1 vs v2 (real-defect eval, n=32)

Training data

Usage (with the base model)

Limitations