Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitHighlights
- Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
- Calibrated abstention — outputs
Not enough informationwhen the context does not support an answer. - Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
- Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.
Model overview
OCC-RAG-1.7B is mid-trained from Qwen/Qwen3-1.7B-Base via supervised fine-tuning on a synthetic corpus of ~3.25M QA pairs (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.
Evaluation
Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.
| Model | HotpotQAIn-Acc | MuSiQueIn-Acc | TAT-QAF1 | ConFiQAIn-Acc | ConFiQAM_R ↓ | MuSiQue-UnR-Acc |
|---|---|---|---|---|---|---|
| gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 |
| Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 | 82.8 |
| Qwen3-4B (think) | 67.1 | 41.5 | 79.1 | 74.1 | 7.5 | 84.0 |
| Pleias-RAG-1.2B | 48.5 | 15.0 | 8.4 | 37.3 | 25.3 | 21.9 |
| OCC-RAG-0.6B | 57.6 | 36.6 | 75.0 | 79.9 | 5.2 | 86.9 |
| OCC-RAG-1.7B | 60.9 | 38.2 | 81.0 | 81.4 | 5.0 | 87.2 |
OCC-RAG-1.7B closes the gap with Qwen3-4B (thinking) on multi-hop reasoning while attaining the best faithfulness (highest ConFiQA In-Acc, lowest M_R) across all evaluated scales, and refusal accuracy on par with 8B+ models. Mid-training reduces the memorization ratio from 12.7 (8.3 in thinking mode) for Qwen3-1.7B down to 5.0.
Input / output format
OCC-RAG uses a structured prompt format with special tokens. The question is wrapped in <|query_start|> … <|query_end|> and each source in <|source_start|><|source_id|>N … <|source_end|>.
The response is split into five sections, each delimited by special tokens:
| Section | Tokens | Content |
|---|---|---|
| Query analysis | <|query_analysis_start|> … <|query_analysis_end|> | Decomposes the question into what must be found. |
| Source analysis | <|source_analysis_start|> … <|source_analysis_end|> | Assesses each source's relevance, citing by <|source_id|>N. |
| Reasoning | <|reasoning_start|> … <|reasoning_end|> | Composes evidence across sources into a multi-hop chain. |
| Status | <|status_start|> … <|status_end|> | ANSWERABLE / UNANSWERABLE verdict. |
| Answer | <|answer_start|> … <|answer_end|> | The final answer span, or the refusal phrase. |
Quickstart (Transformers)
The chat template accepts a documents= kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.
python
import refrom transformers import AutoModelForCausalLM, AutoTokenizerMODEL = "occ-ai/OCC-RAG-1.7B"tokenizer = AutoTokenizer.from_pretrained(MODEL)model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"documents = [{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},{"text": "Nova Scotia is a province on the east coast of Canada."},]text = tokenizer.apply_chat_template([{"role": "user", "content": question}],documents=documents,tokenize=False,add_generation_prompt=True,enable_thinking=False,)# Alternative: assemble the structural tokens yourself.## query_start, query_end = "<|query_start|>", "<|query_end|>"# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"## def build_user_content(question, sources):# content = f"{query_start}{question}{query_end}\n"# for i, s in enumerate(sources, start=1):# content += f"{source_start}{source_id}{i} {s}{source_end}\n"# return content## messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]# text = tokenizer.apply_chat_template(# messages, tokenize=False, add_generation_prompt=True, enable_thinking=False# )inputs = tokenizer([text], return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=2048)response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)print(response)m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)print("Answer:", m[-1].strip() if m else "") # -> Canada
[!NOTE] We recommend greedy decoding (
do_sample=False), which is the training/evaluation default and is baked intogeneration_config.json. Qwen3's default sampling parameters (best practices) also work fine.
Deployment
OCC-RAG-1.7B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 1.7B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep skip_special_tokens=False if you need to parse the structural tokens out of the raw output.
When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the documents= kwarg is reachable from the client via chat_template_kwargs:
python
client.chat.completions.create(model="occ-ai/OCC-RAG-1.7B",messages=[{"role": "user", "content": question}],extra_body={"chat_template_kwargs": {"documents": documents}},)
Limitations
- Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
- Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.
Citation
If you find our work helpful, feel free to give us a cite.
bibtex
@misc{savkin2026occragoptimalcognitivecore,title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},year = {2026},eprint = {2606.00683},archivePrefix = {arXiv},primaryClass = {cs.CL},url = {https://arxiv.org/abs/2606.00683}}
Model provider
occ-ai
Model tree
Base
Qwen/Qwen3-1.7B-Base
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information