bartek-flp/qwen3coder-30b-dcr-lora-v4 API & Inference Endpoint

Results (this session, base + v3 + v4 scored together, temperature 0)

Security set (n=32, 16 pos / 16 neg)

Table with columns: Metric, Base, v3, v4
Metric	Base	v3	v4
Verdict accuracy	71.9% (23/32)	84.4% (27/32)	90.6% (29/32)
Positive recall	87.5% (14/16)	75.0% (12/16)	81.2% (13/16)
Negative specificity	56.2% (9/16)	93.8% (15/16)	100% (16/16)
Category match	56.2%	50.0%	56.2%
Invalid JSON	0/32	0/32	0/32

Non-security set (n=26, 13 pos / 13 neg)

Table with columns: Metric, Base, v3, v4
Metric	Base	v3	v4
Verdict accuracy	65.4% (17/26)	61.5% (16/26)	76.9% (20/26)
Positive recall	69.2% (9/13)	30.8% (4/13)	61.5% (8/13)
Negative specificity	61.5% (8/13)	92.3% (12/13)	92.3% (12/13)
Category match	53.8%	23.1%	46.2%

The non-security recall jump (v3 4/13 → v4 8/13) and the non-security verdict gain (16/26 → 20/26) are four-case moves, beyond the run-to-run noise. The security specificity and verdict lead are smaller margins, but nothing regressed and they point the same way.

Training data

v3's 526 rows + 41 real bug-fix positive/negative rows (mined from merged Drupal MRs, inverted, teacher-labeled). QLoRA r=16 on q/k/v/o, batch 4 + grad-accum 4 + grad-ckpt, MAX_LEN=2048, 3 epochs, lr 2e-4. Trained on one H100, 114 steps, ~106 min.

Limitations

Real-defect recall is still ~60% on non-security and ~80% on security — roughly two in five non-security bugs slip through. Category match is mediocre (46–56%): the model is better at "something is wrong" than at naming the kind. Raw recall is higher on the untuned base, but base flags nearly half of all clean code (specificity 56–62%), which is why v4 is the better tool despite trading a little recall for usable specificity. Keep a human in the loop; this adapter is one component of a hybrid pipeline (static analyzers + RAG + the model).

Results (this session, base + v3 + v4 scored together, temperature 0)

Security set (n=32, 16 pos / 16 neg)

Table with columns: Metric, Base, v3, v4
Metric	Base	v3	v4
Verdict accuracy	71.9% (23/32)	84.4% (27/32)	90.6% (29/32)
Positive recall	87.5% (14/16)	75.0% (12/16)	81.2% (13/16)
Negative specificity	56.2% (9/16)	93.8% (15/16)	100% (16/16)
Category match	56.2%	50.0%	56.2%
Invalid JSON	0/32	0/32	0/32

Non-security set (n=26, 13 pos / 13 neg)

Table with columns: Metric, Base, v3, v4
Metric	Base	v3	v4
Verdict accuracy	65.4% (17/26)	61.5% (16/26)	76.9% (20/26)
Positive recall	69.2% (9/13)	30.8% (4/13)	61.5% (8/13)
Negative specificity	61.5% (8/13)	92.3% (12/13)	92.3% (12/13)
Category match	53.8%	23.1%	46.2%

qwen3coder-30b-dcr-lora-v4

Get help setting up a custom Dedicated Endpoints.

README

Results (this session, base + v3 + v4 scored together, temperature 0)

Security set (n=32, 16 pos / 16 neg)

Non-security set (n=26, 13 pos / 13 neg)

Training data

Limitations

Explore FriendliAI today

README

Results (this session, base + v3 + v4 scored together, temperature 0)

Security set (n=32, 16 pos / 16 neg)

Non-security set (n=26, 13 pos / 13 neg)

Training data

Limitations