coordinator-qwen3-14b-qlora-grounded API & Inference Endpoint

Training

Data: 98 RAFT-style self-distilled examples built from the train half of a leakage-disjoint split of Legal RAG Bench (oracle passages + question -> quote-first answers), candidates filtered by the team's DeBERTa verifier (keep-top-3 by composite).
Recipe: QLoRA r=32 on 4-bit NF4 Qwen3-14B; 128.4M trainable params (0.86%); 2 epochs, effective batch 7, 14 optimizer steps; final train loss 0.362. Single seed, single run.

Evaluation (component-level, 50 held-out items, verifier-reranked best-of-8)

Table with columns: axis, base, hybrid, grounded (this)
axis	base	hybrid	grounded (this)
strict verbatim grounding	0.824	0.819	0.962
content verbatim	0.987	0.980	0.995
citation validity	0.933	0.911	1.000
gold ROUGE-L >= 0.30	0.511	–	0.587
false refusals (oracle present)	0.10	0.10	0.08
strict verbatim under stress (K=4 distractors + 30% oracle drop)	0.689	0.661	0.984

Mechanism: the fine-tune eliminates the near-miss (cosmetically altered quote) failure class — content-minus-strict gap 0.163 -> 0.033 clean and 0.228 -> 0.000 under stress. Verifier support is flat (0.838 -> 0.820): the entailment channel was never the deficit. Training is worth ~one doubling of inference compute (grounded N=1 composite 0.558 ~= base N=2 0.559).

Intended use & limitations

Format-locked: improvements hold under the EVIDENCE/ANSWER quote-first contract; under a free-form prompt, behavior matches base (support 0.724 vs 0.720). Use with the contract, with documents presented as [doc_id] text.
Component-level claims only: not yet integrated end-to-end; integration requires format-aware parsing and a language-matched prompt.
n=50 eval, single training seed; residual stress failure mode is citation-index slips on correct quotes (verbatim 0.984 vs attribution 0.959).

Siblings

adaptor-marianmt-fr-en · safeguard-deberta-ragtruth-v1 · coordinator-qwen3-14b-qlora-hybrid

Training

Data: 98 RAFT-style self-distilled examples built from the train half of a leakage-disjoint split of Legal RAG Bench (oracle passages + question -> quote-first answers), candidates filtered by the team's DeBERTa verifier (keep-top-3 by composite).
Recipe: QLoRA r=32 on 4-bit NF4 Qwen3-14B; 128.4M trainable params (0.86%); 2 epochs, effective batch 7, 14 optimizer steps; final train loss 0.362. Single seed, single run.

Evaluation (component-level, 50 held-out items, verifier-reranked best-of-8)

Table with columns: axis, base, hybrid, grounded (this)
axis	base	hybrid	grounded (this)
strict verbatim grounding	0.824	0.819	0.962
content verbatim	0.987	0.980	0.995
citation validity	0.933	0.911	1.000
gold ROUGE-L >= 0.30	0.511	–	0.587
false refusals (oracle present)	0.10	0.10	0.08
strict verbatim under stress (K=4 distractors + 30% oracle drop)	0.689	0.661	0.984

Intended use & limitations

Format-locked: improvements hold under the EVIDENCE/ANSWER quote-first contract; under a free-form prompt, behavior matches base (support 0.724 vs 0.720). Use with the contract, with documents presented as [doc_id] text.
Component-level claims only: not yet integrated end-to-end; integration requires format-aware parsing and a language-matched prompt.
n=50 eval, single training seed; residual stress failure mode is citation-index slips on correct quotes (verbatim 0.984 vs attribution 0.959).

Siblings

adaptor-marianmt-fr-en · safeguard-deberta-ragtruth-v1 · coordinator-qwen3-14b-qlora-hybrid

coordinator-qwen3-14b-qlora-grounded

README

Training

Evaluation (component-level, 50 held-out items, verifier-reranked best-of-8)

Intended use & limitations

Siblings

Explore FriendliAI today

README

Training

Evaluation (component-level, 50 held-out items, verifier-reranked best-of-8)

Intended use & limitations

Siblings