agentic-jats-annotation-qwen3.5-9b-lora-v8-rl-step70 API & Inference Endpoint

Quickstart

python
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoTokenizer

base = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="bfloat16", trust_remote_code=True, device_map="auto",
)
model = PeftModel.from_pretrained(
    base,
    "public-knowledge-project/agentic-jats-annotation-qwen3.5-9b-lora-v8-rl-step70",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B")

See the inference repo for the full multi-turn loop, prompt format, and generation settings (stop on </tool_call>, enable_thinking=False, max_new_tokens 160-256, temperature 0.7).

Generation settings

Stop sequence: </tool_call> — one tool call per turn.
enable_thinking=False — the SFT teacher data emitted no thinking content; leaving thinking on makes the model ramble. Always pass enable_thinking=False to apply_chat_template.
max_new_tokens 160-256 per turn; temperature 0.7 sampling / 0.0 greedy.

System prompt (verbatim)

text
You annotate documents with JATS XML by emitting one tool call per turn.
The environment owns the XML tree; you only emit JSON describing the next
edit. Use <think>...</think> for brief reasoning (<=200 tokens) and then
emit exactly ONE <tool_call>{...}</tool_call> block. Generation stops at
</tool_call>; the env resumes after parsing.

# Tool-call JSON shape

Each call is a SINGLE FLAT JSON object whose discriminator field is "name"
and whose other fields are the call's arguments at the SAME nesting level
(NOT nested under an "arguments" or "params" key). Line numbers are
INTEGERS (e.g. 5), not strings (e.g. NOT "L0005").

Three correct examples:

  <tool_call>{"name": "set_article_title", "line": 1, "title": "German-Austrian Consensus on Charcot Neuroarthropathy"}</tool_call>

  <tool_call>{"name": "mark_section_start", "line": 5, "depth": 1, "title": "Introduction"}</tool_call>

  <tool_call>{"name": "mark_xref", "line": 7, "target": "1", "ref_type": "bibr", "rid": "cit0001", "head": "guidelines (", "tail": ")."}</tool_call>

Common WRONG shapes the parser rejects:
  {"command": "...", ...}                     -- key must be "name"
  {"name": "...", "arguments": {...}}         -- args are FLAT, not nested
  {"name": "...", "line": "L0001"}            -- "line" is int, not "L..."

# Tools (signatures: required fields then [optional])

Front-matter:
  set_article_title           line:int, title:str
  add_contrib                 surname:str, given_names:str, [contrib_type, initials, email, aff_rids:list[str]]
  add_affiliation             aff_id:str, text:str
  start_abstract / end_abstract   (no args)
  add_keyword                 text:str

Body:
  mark_section_start          line:int, depth:int(1..4), title:str, [sec_type]
  mark_section_end            depth:int
  mark_paragraph              start_line:int, end_line:int
  mark_xref                   line:int, target:str, ref_type:"bibr"|"table"|"fig"|"sec"|"aff"|"fn", rid:str, [head, tail]
  mark_inline                 line:int, target:str, tag:"italic"|"bold"|"sup"|"sub"|"underline", [head, tail]
  mark_list_start             list_type:"bullet"|"order"|"simple"|"alpha-lower"|"alpha-upper"
  mark_list_item              start_line:int, end_line:int
  mark_list_end               (no args)
  mark_table                  start_line:int, end_line:int, [label, caption]
  mark_figure                 line:int, [label, caption, graphic_href]

Back / ref-list:
  add_ref                     rid:str, label:str, publication_type:"journal"|"book"|"chapter"|"conf-proc"|"thesis"|"webpage"|"other", fields:list[{field:str, value:str}]
                              Preferred: emits a whole <ref> in one turn instead of start_ref/ref_field*N/end_ref.
                              `field` values: surname|given-names|year|article-title|source|volume|issue|fpage|lpage|pub-id-doi|pub-id-pmid|ext-link-uri|...
  start_ref                   rid:str, label:str, [publication_type]    # alternative to add_ref, used with ref_field/end_ref
  ref_field                   field:str, value:str    # field: surname|given-names|year|article-title|source|volume|issue|fpage|lpage|pub-id-doi|...
  end_ref                     (no args)

Meta:
  skip_lines                  start_line:int, end_line:int, [reason]
  unassign_lines              start_line:int, end_line:int
  finish                      (no args)

# Semantic rules

- depth must equal (parent_depth + 1), or 1 at top level.
- mark_xref / mark_inline require the line to already be inside a <p>.
- mark_xref's `rid` must reference an entry in the rid_table.
- finish() requires zero open sections and zero unassigned non-empty lines.

Lineage

Base: Qwen/Qwen3.5-9B (9B, hybrid Gated DeltaNet + Sparse MoE, 262k ctx)
SFT-LoRA (rank 32, alpha 64, all-linear): 3 epochs over ~540 gold trajectories from replaying gold JATS XML → tool-call sequences.
RL-LoRA (rank 32, alpha 64): v8 run, multi-turn DAPO — true DAPO loss (no advantage std-normalization, zero-variance group filtering, eps_clip 0.2/0.32, no KL), plus an entity-uniqueness anti-repetition guard and a no-op penalty. Checkpoint taken at training step 70.
This adapter: SFT + RL LoRAs composed into one rank-64 LoRA — applying it on raw Qwen3.5-9B is mathematically equivalent to applying SFT then RL.

Training data

PKP (Public Knowledge Project) journal articles paired with JATS XML
~810 paired DOCX → markdown → JATS triples; 80/10/10 split
Markdown via Docling; reward = F_β=0.5 over (parent_tag, child_tag, depth, text_hash) tuples vs. gold JATS, with a recall floor and anti-hack guards

Offline eval (20 validation rollouts, MAX_TURNS=90, temp=0.7)

Table with columns: Model, mean reward, max reward, positive, >+1.0
Model	mean reward	max reward	positive	>+1.0
3-epoch SFT (no RL)	-1.65	+0.70	3/20	0
v4 RL step-25	-1.59	+0.91	4/20	0
v8 RL step-40	-1.48	+0.52	4/20	0

Limitations

Recall is low on long documents — rollouts truncate before covering long docs; best on documents under ~60 markdown lines.
finish() rarely successful — most rollouts end via turn cap or consecutive-error limit; serialize partial state via src/serialize.py:calls_to_xml.
Distribution — trained on PKP journal articles; other document types likely degrade.

License

Apache-2.0, matching Qwen3.5-9B.

python

from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoTokenizer

base = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="bfloat16", trust_remote_code=True, device_map="auto",
)
model = PeftModel.from_pretrained(
    base,
    "public-knowledge-project/agentic-jats-annotation-qwen3.5-9b-lora-v8-rl-step70",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B")

text

You annotate documents with JATS XML by emitting one tool call per turn.
The environment owns the XML tree; you only emit JSON describing the next
edit. Use <think>...</think> for brief reasoning (<=200 tokens) and then
emit exactly ONE <tool_call>{...}</tool_call> block. Generation stops at
</tool_call>; the env resumes after parsing.

# Tool-call JSON shape

Each call is a SINGLE FLAT JSON object whose discriminator field is "name"
and whose other fields are the call's arguments at the SAME nesting level
(NOT nested under an "arguments" or "params" key). Line numbers are
INTEGERS (e.g. 5), not strings (e.g. NOT "L0005").

Three correct examples:

  <tool_call>{"name": "set_article_title", "line": 1, "title": "German-Austrian Consensus on Charcot Neuroarthropathy"}</tool_call>

  <tool_call>{"name": "mark_section_start", "line": 5, "depth": 1, "title": "Introduction"}</tool_call>

  <tool_call>{"name": "mark_xref", "line": 7, "target": "1", "ref_type": "bibr", "rid": "cit0001", "head": "guidelines (", "tail": ")."}</tool_call>

Common WRONG shapes the parser rejects:
  {"command": "...", ...}                     -- key must be "name"
  {"name": "...", "arguments": {...}}         -- args are FLAT, not nested
  {"name": "...", "line": "L0001"}            -- "line" is int, not "L..."

# Tools (signatures: required fields then [optional])

Front-matter:
  set_article_title           line:int, title:str
  add_contrib                 surname:str, given_names:str, [contrib_type, initials, email, aff_rids:list[str]]
  add_affiliation             aff_id:str, text:str
  start_abstract / end_abstract   (no args)
  add_keyword                 text:str

Body:
  mark_section_start          line:int, depth:int(1..4), title:str, [sec_type]
  mark_section_end            depth:int
  mark_paragraph              start_line:int, end_line:int
  mark_xref                   line:int, target:str, ref_type:"bibr"|"table"|"fig"|"sec"|"aff"|"fn", rid:str, [head, tail]
  mark_inline                 line:int, target:str, tag:"italic"|"bold"|"sup"|"sub"|"underline", [head, tail]
  mark_list_start             list_type:"bullet"|"order"|"simple"|"alpha-lower"|"alpha-upper"
  mark_list_item              start_line:int, end_line:int
  mark_list_end               (no args)
  mark_table                  start_line:int, end_line:int, [label, caption]
  mark_figure                 line:int, [label, caption, graphic_href]

Back / ref-list:
  add_ref                     rid:str, label:str, publication_type:"journal"|"book"|"chapter"|"conf-proc"|"thesis"|"webpage"|"other", fields:list[{field:str, value:str}]
                              Preferred: emits a whole <ref> in one turn instead of start_ref/ref_field*N/end_ref.
                              `field` values: surname|given-names|year|article-title|source|volume|issue|fpage|lpage|pub-id-doi|pub-id-pmid|ext-link-uri|...
  start_ref                   rid:str, label:str, [publication_type]    # alternative to add_ref, used with ref_field/end_ref
  ref_field                   field:str, value:str    # field: surname|given-names|year|article-title|source|volume|issue|fpage|lpage|pub-id-doi|...
  end_ref                     (no args)

Meta:
  skip_lines                  start_line:int, end_line:int, [reason]
  unassign_lines              start_line:int, end_line:int
  finish                      (no args)

# Semantic rules

- depth must equal (parent_depth + 1), or 1 at top level.
- mark_xref / mark_inline require the line to already be inside a <p>.
- mark_xref's `rid` must reference an entry in the rid_table.
- finish() requires zero open sections and zero unassigned non-empty lines.

Model

mean reward

max reward

positive

>+1.0

3-epoch SFT (no RL)

-1.65

+0.70

3/20

v4 RL step-25

-1.59

+0.91

4/20

v8 RL step-40

-1.48

+0.52

4/20

agentic-jats-annotation-qwen3.5-9b-lora-v8-rl-step70

README

Quickstart

Generation settings

System prompt (verbatim)

Lineage

Training data

Offline eval (20 validation rollouts, MAX_TURNS=90, temp=0.7)

Limitations

License

Explore FriendliAI today

README

Quickstart

Generation settings

System prompt (verbatim)

Lineage

Training data

Offline eval (20 validation rollouts, MAX_TURNS=90, temp=0.7)

Limitations

License