occ-ai/OCC-RAG-1.7B API & Inference Endpoint

Highlights

Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
Calibrated abstention — outputs Not enough information when the context does not support an answer.
Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.

Model overview

OCC-RAG-1.7B is mid-trained from Qwen/Qwen3-1.7B-Base via supervised fine-tuning on a synthetic corpus of ~3.25M QA pairs (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.

Evaluation

Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.

Model	HotpotQAIn-Acc	MuSiQueIn-Acc	TAT-QAF1	ConFiQAIn-Acc	ConFiQAM_R ↓	MuSiQue-UnR-Acc
gemma-3-4b-it	55.8	30.1	65.3	69.8	8.9	55.8
Qwen3-1.7B (think)	60.9	30.7	74.8	70.4	8.3	82.8
Qwen3-4B (think)	67.1	41.5	79.1	74.1	7.5	84.0
Pleias-RAG-1.2B	48.5	15.0	8.4	37.3	25.3	21.9
OCC-RAG-0.6B	57.6	36.6	75.0	79.9	5.2	86.9
OCC-RAG-1.7B	60.9	38.2	81.0	81.4	5.0	87.2

OCC-RAG-1.7B closes the gap with Qwen3-4B (thinking) on multi-hop reasoning while attaining the best faithfulness (highest ConFiQA In-Acc, lowest M_R) across all evaluated scales, and refusal accuracy on par with 8B+ models. Mid-training reduces the memorization ratio from 12.7 (8.3 in thinking mode) for Qwen3-1.7B down to 5.0.

Input / output format

The response is split into five sections, each delimited by special tokens:

Section	Tokens	Content
Query analysis	`<\|query_analysis_start\|> … <\|query_analysis_end\|>`	Decomposes the question into what must be found.
Source analysis	`<\|source_analysis_start\|> … <\|source_analysis_end\|>`	Assesses each source's relevance, citing by `<\|source_id\|>N`.
Reasoning	`<\|reasoning_start\|> … <\|reasoning_end\|>`	Composes evidence across sources into a multi-hop chain.
Status	`<\|status_start\|> … <\|status_end\|>`	`ANSWERABLE` / `UNANSWERABLE` verdict.
Answer	`<\|answer_start\|> … <\|answer_end\|>`	The final answer span, or the refusal phrase.

Quickstart (Transformers)

The chat template accepts a documents= kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.

python
import re
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "occ-ai/OCC-RAG-1.7B"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

# Alternative: assemble the structural tokens yourself.
#
# query_start, query_end = "<|query_start|>", "<|query_end|>"
# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
#
# def build_user_content(question, sources):
#     content = f"{query_start}{question}{query_end}\n"
#     for i, s in enumerate(sources, start=1):
#         content += f"{source_start}{source_id}{i} {s}{source_end}\n"
#     return content
#
# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
# text = tokenizer.apply_chat_template(
#     messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
# )

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)

m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "")   # -> Canada

[!NOTE] We recommend greedy decoding (do_sample=False), which is the training/evaluation default and is baked into generation_config.json. Qwen3's default sampling parameters (best practices) also work fine.

Deployment

OCC-RAG-1.7B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 1.7B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep skip_special_tokens=False if you need to parse the structural tokens out of the raw output.

When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the documents= kwarg is reachable from the client via chat_template_kwargs:

python
client.chat.completions.create(
    model="occ-ai/OCC-RAG-1.7B",
    messages=[{"role": "user", "content": question}],
    extra_body={"chat_template_kwargs": {"documents": documents}},
)

Limitations

Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.

Citation

If you find our work helpful, feel free to give us a cite.

bibtex
@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}

OCC-RAG-1.7B

Get help setting up a custom Dedicated Endpoints.

README