Highlights
- Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
- Calibrated abstention — outputs
Not enough information when the context does not support an answer.
- Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
- Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.
Model overview
OCC-RAG-0.6B is mid-trained from Qwen/Qwen3-0.6B-Base via supervised fine-tuning on a synthetic corpus of ~3.25M QA pairs (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.
Evaluation
Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.
Table with columns: Model, HotpotQAIn-Acc, MuSiQueIn-Acc, TAT-QAF1, ConFiQAIn-Acc, ConFiQAM_R ↓, MuSiQue-UnR-Acc| Model | HotpotQAIn-Acc | MuSiQueIn-Acc | TAT-QAF1 | ConFiQAIn-Acc | ConFiQAM_R ↓ | MuSiQue-UnR-Acc |
|---|
| gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 |
| Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 |
OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.
OCC-RAG uses a structured prompt format with special tokens. The question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>.
The response is split into five sections, each delimited by special tokens:
Table with columns: Section, Tokens, Content| Section | Tokens | Content |
|---|
| Query analysis | <|query_analysis_start|> … <|query_analysis_end|> | Decomposes the question into what must be found. |
| Source analysis | <|source_analysis_start|> … <|source_analysis_end|> | Assesses each source's relevance, citing by <|source_id|>N. |
| Reasoning | <|reasoning_start|> … <|reasoning_end|> | Composes evidence across sources into a multi-hop chain. |
| Status | |
The chat template accepts a documents= kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.
import re
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL = "occ-ai/OCC-RAG-0.6B"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
{"text": "Nova Scotia is a province on the east coast of Canada."},
]
text = tokenizer.apply_chat_template(
[{"role": "user", "content": question}],
documents=documents,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "")
[!NOTE]
We recommend greedy decoding (do_sample=False), which is the training/evaluation default and is baked into generation_config.json. Qwen3's default sampling parameters (best practices) also work fine.
Deployment
OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep skip_special_tokens=False if you need to parse the structural tokens out of the raw output.
When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the documents= kwarg is reachable from the client via chat_template_kwargs:
client.chat.completions.create(
model="occ-ai/OCC-RAG-0.6B",
messages=[{"role": "user", "content": question}],
extra_body={"chat_template_kwargs": {"documents": documents}},
)
Limitations
- Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
- Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{savkin2026occragoptimalcognitivecore,
title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
year = {2026},
eprint = {2606.00683},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.00683}
}