Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Highlights

  • Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
  • Calibrated abstention — outputs Not enough information when the context does not support an answer.
  • Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
  • Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.

Model overview

OCC-RAG-1.7B is mid-trained from Qwen/Qwen3-1.7B-Base via supervised fine-tuning on a synthetic corpus of ~3.25M QA pairs (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.

Evaluation

Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.

ModelHotpotQAIn-AccMuSiQueIn-AccTAT-QAF1ConFiQAIn-AccConFiQAM_R ↓MuSiQue-UnR-Acc
gemma-3-4b-it55.830.165.369.88.955.8
Qwen3-1.7B (think)60.930.774.870.48.382.8
Qwen3-4B (think)67.141.579.174.17.584.0
Pleias-RAG-1.2B48.515.08.437.325.321.9
OCC-RAG-0.6B57.636.675.079.95.286.9
OCC-RAG-1.7B60.938.281.081.45.087.2

OCC-RAG-1.7B closes the gap with Qwen3-4B (thinking) on multi-hop reasoning while attaining the best faithfulness (highest ConFiQA In-Acc, lowest M_R) across all evaluated scales, and refusal accuracy on par with 8B+ models. Mid-training reduces the memorization ratio from 12.7 (8.3 in thinking mode) for Qwen3-1.7B down to 5.0.

Input / output format

OCC-RAG uses a structured prompt format with special tokens. The question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>.

The response is split into five sections, each delimited by special tokens:

SectionTokensContent
Query analysis<|query_analysis_start|> … <|query_analysis_end|>Decomposes the question into what must be found.
Source analysis<|source_analysis_start|> … <|source_analysis_end|>Assesses each source's relevance, citing by <|source_id|>N.
Reasoning<|reasoning_start|> … <|reasoning_end|>Composes evidence across sources into a multi-hop chain.
Status<|status_start|> … <|status_end|>ANSWERABLE / UNANSWERABLE verdict.
Answer<|answer_start|> … <|answer_end|>The final answer span, or the refusal phrase.

Quickstart (Transformers)

The chat template accepts a documents= kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.

python

import re
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL = "occ-ai/OCC-RAG-1.7B"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
{"text": "Nova Scotia is a province on the east coast of Canada."},
]
text = tokenizer.apply_chat_template(
[{"role": "user", "content": question}],
documents=documents,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
# Alternative: assemble the structural tokens yourself.
#
# query_start, query_end = "<|query_start|>", "<|query_end|>"
# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
#
# def build_user_content(question, sources):
# content = f"{query_start}{question}{query_end}\n"
# for i, s in enumerate(sources, start=1):
# content += f"{source_start}{source_id}{i} {s}{source_end}\n"
# return content
#
# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
# text = tokenizer.apply_chat_template(
# messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
# )
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "") # -> Canada

[!NOTE] We recommend greedy decoding (do_sample=False), which is the training/evaluation default and is baked into generation_config.json. Qwen3's default sampling parameters (best practices) also work fine.

Deployment

OCC-RAG-1.7B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 1.7B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep skip_special_tokens=False if you need to parse the structural tokens out of the raw output.

When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the documents= kwarg is reachable from the client via chat_template_kwargs:

python

client.chat.completions.create(
model="occ-ai/OCC-RAG-1.7B",
messages=[{"role": "user", "content": question}],
extra_body={"chat_template_kwargs": {"documents": documents}},
)

Limitations

  • Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
  • Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.

Citation

If you find our work helpful, feel free to give us a cite.

bibtex

@misc{savkin2026occragoptimalcognitivecore,
title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
year = {2026},
eprint = {2606.00683},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.00683}
}

Model provider

occ-ai

Model tree

Base

Qwen/Qwen3-1.7B-Base

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today