useitone

OCC-RAG-0.6B

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Highlights

Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
Calibrated abstention — outputs Not enough information when the context does not support an answer.
Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.

Model overview

OCC-RAG-0.6B is mid-trained from Qwen/Qwen3-0.6B-Base via supervised fine-tuning on a synthetic corpus of ~3.25M QA pairs (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.

Evaluation

Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.

Table with columns: Model, HotpotQAIn-Acc, MuSiQueIn-Acc, TAT-QAF1, ConFiQAIn-Acc, ConFiQAM_R ↓, MuSiQue-UnR-Acc
Model	HotpotQAIn-Acc	MuSiQueIn-Acc	TAT-QAF1	ConFiQAIn-Acc	ConFiQAM_R ↓	MuSiQue-UnR-Acc
gemma-3-4b-it	55.8	30.1	65.3	69.8	8.9	55.8
Qwen3-1.7B (think)	60.9	30.7	74.8	70.4	8.3

OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.

Input / output format

The response is split into five sections, each delimited by special tokens:

Table with columns: Section, Tokens, Content
Section	Tokens	Content
Query analysis	`<\|query_analysis_start\|> … <\|query_analysis_end\|>`	Decomposes the question into what must be found.
Source analysis	`<\|source_analysis_start\|> … <\|source_analysis_end\|>`	Assesses each source's relevance, citing by `<\|source_id\|>N`.
Reasoning	`<\|reasoning_start\|> … <\|reasoning_end\|>`	Composes evidence across sources into a multi-hop chain.
Status

Quickstart (Transformers)

The chat template accepts a documents= kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.

python
import re
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "occ-ai/OCC-RAG-0.6B"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

# Alternative: assemble the structural tokens yourself.
#
# query_start, query_end = "<|query_start|>", "<|query_end|>"
# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
#
# def build_user_content(question, sources):
#     content = f"{query_start}{question}{query_end}\n"
#     for i, s in enumerate(sources, start=1):
#         content += f"{source_start}{source_id}{i} {s}{source_end}\n"
#     return content
#
# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
# text = tokenizer.apply_chat_template(
#     messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
# )

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)

m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "")   # -> Canada

[!NOTE] We recommend greedy decoding (do_sample=False), which is the training/evaluation default and is baked into generation_config.json. Qwen3's default sampling parameters (best practices) also work fine.

Deployment

OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep skip_special_tokens=False if you need to parse the structural tokens out of the raw output.

When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the documents= kwarg is reachable from the client via chat_template_kwargs:

python
client.chat.completions.create(
    model="occ-ai/OCC-RAG-0.6B",
    messages=[{"role": "user", "content": question}],
    extra_body={"chat_template_kwargs": {"documents": documents}},
)

Limitations

Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.

Citation

If you find our work helpful, feel free to give us a cite.

bibtex
@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}

Model provider

useitone

Model tree

Base

Qwen/Qwen3-0.6B-Base

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Highlights

Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
Calibrated abstention — outputs Not enough information when the context does not support an answer.
Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.

Model overview

Evaluation

Table with columns: Model, HotpotQAIn-Acc, MuSiQueIn-Acc, TAT-QAF1, ConFiQAIn-Acc, ConFiQAM_R ↓, MuSiQue-UnR-Acc
Model	HotpotQAIn-Acc	MuSiQueIn-Acc	TAT-QAF1	ConFiQAIn-Acc	ConFiQAM_R ↓	MuSiQue-UnR-Acc
gemma-3-4b-it	55.8	30.1	65.3	69.8	8.9	55.8
Qwen3-1.7B (think)	60.9	30.7	74.8	70.4	8.3

OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.

Input / output format

The response is split into five sections, each delimited by special tokens:

Table with columns: Section, Tokens, Content
Section	Tokens	Content
Query analysis	`<\|query_analysis_start\|> … <\|query_analysis_end\|>`	Decomposes the question into what must be found.
Source analysis	`<\|source_analysis_start\|> … <\|source_analysis_end\|>`	Assesses each source's relevance, citing by `<\|source_id\|>N`.
Reasoning	`<\|reasoning_start\|> … <\|reasoning_end\|>`	Composes evidence across sources into a multi-hop chain.
Status

Quickstart (Transformers)

The chat template accepts a documents= kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.

python
import re
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "occ-ai/OCC-RAG-0.6B"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

# Alternative: assemble the structural tokens yourself.
#
# query_start, query_end = "<|query_start|>", "<|query_end|>"
# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
#
# def build_user_content(question, sources):
#     content = f"{query_start}{question}{query_end}\n"
#     for i, s in enumerate(sources, start=1):
#         content += f"{source_start}{source_id}{i} {s}{source_end}\n"
#     return content
#
# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
# text = tokenizer.apply_chat_template(
#     messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
# )

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)

m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "")   # -> Canada

[!NOTE] We recommend greedy decoding (do_sample=False), which is the training/evaluation default and is baked into generation_config.json. Qwen3's default sampling parameters (best practices) also work fine.

Deployment

When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the documents= kwarg is reachable from the client via chat_template_kwargs:

python
client.chat.completions.create(
    model="occ-ai/OCC-RAG-0.6B",
    messages=[{"role": "user", "content": question}],
    extra_body={"chat_template_kwargs": {"documents": documents}},
)

Limitations

Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.

Citation

If you find our work helpful, feel free to give us a cite.

bibtex
@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}

OCC-RAG-0.6B

Get help setting up a custom Dedicated Endpoints.

README

Highlights

Model overview

Evaluation

Input / output format

Quickstart (Transformers)

Deployment

Limitations

Citation

Explore FriendliAI today

README

Highlights

Model overview

Evaluation

Input / output format

Quickstart (Transformers)

Deployment

Limitations

Citation