bhxdianzhang/ParaReviewer-SFT API & Inference Endpoint

Overview

ParaReviewer-SFT is a private ParadoxGPT reviewer specialist model fine-tuned from a Qwen 4B base model for reviewer cognition modeling: explaining why reviewers raise specific concerns, identifying transferable review rules, and helping authors proactively detect similar risks before submission.

The model reads paper context plus a real review, then produces a Chinese mentor-style analysis with an explicit thinking block followed by the final answer. It is intended for internal ParadoxGPT research workflows around paper review analysis, rebuttal preparation, and research-agent training.

Intended Use

Use ParaReviewer-SFT for:

analyzing why real reviewers raised particular concerns;
extracting concern severity, repair direction, and reviewer judgment rules;
profiling recurring review risks across papers;
supporting ParadoxGPT reviewer/rebuttal workflows;
generating training or evaluation traces for reviewer-cognition agents.

This model is not intended to replace human scientific judgment. Its outputs should be checked against the original paper and review.

Training Data

The model was supervised-fine-tuned on the private dataset bhxdianzhang/ParaSFT-reviewer.

Table
Split	Samples
Train	21,561
Dev	1,166
Test	1,356
Total	24,083

Task type: stage_a_concern_cognition.

Quality filter: each target answer contains exactly one balanced <think>...</think> reasoning block followed by the final answer.

Training Summary

Base model: Qwen 4B local base model.

Table
Field	Value
Epochs	3.0
Learning rate	3e-6
Scheduler	cosine
Distributed devices	2
Gradient accumulation	8
Total train batch size	16
Final train loss	0.8205597753
Final eval loss	0.8410579562

Selected validation trace:

Table
Step	Epoch	Validation loss
800	0.5936	0.9018
1600	1.1870	0.8633
2400	1.7806	0.8434
3200	2.3740	0.8428
4000	2.9676	0.8411
4044	3.0000	0.8411

Qualitative Behavior

ParaReviewer-SFT is designed as a small specialist model, not a general-purpose assistant. In reviewer-cognition tasks, the fine-tuning mainly improves task entry, concern grounding, Chinese explanatory style, and repair-oriented analysis.

Observed qualitative pattern:

Table
Model type	Typical behavior on the same paper-review input
Same 4B backbone before SFT	Often spends many tokens on English task planning, then gives a broad review-analysis template. It may identify the topic, but concern severity, repair direction, and proactive detection are less stable.
Larger general chat model	Usually produces fluent and helpful prose, but may summarize the review rather than model the reviewer's decision logic. It often needs stronger prompting to separate "what the reviewer said" from "why the reviewer thought this mattered."
ParaReviewer-SFT	Enters the reviewer-cognition frame directly: concern root cause, validity, severity, how to fix, transferable review rule, and how to detect the same risk without seeing the review.

This means a 4B specialist can be more useful than a much larger general model inside this narrow workflow when the goal is to turn real reviews into reusable reviewer-risk signals. It does not mean the model matches larger models on general reasoning, broad knowledge, or open-ended scientific judgment.

Example: How to Use It Well

Best prompt shape:

text
你是顶会审稿人认知建模专家。给定一篇论文和一条真实审稿意见，请用中文解释审稿人为什么会提出这些 concern，
总结可迁移的审稿判断规则，并说明不看 review 时如何主动发现类似问题。回答要像资深导师分析审稿意见。

<TITLE>
Adaptive Evidence Routing for Retrieval-Augmented Scientific QA
</TITLE>

<ABSTRACT>
We propose Adaptive Evidence Routing, a retrieval-augmented QA framework that routes each question to a dense retriever, a symbolic citation graph, or a hybrid path. The paper claims that the routing module is the main reason for improved factuality and citation precision. Experiments report higher answer accuracy on three scientific QA benchmarks.
</ABSTRACT>

<INTRODUCTION>
Prior RAG systems often retrieve one flat evidence set for every query. We argue that scientific questions require different evidence types: definitions need canonical sources, comparison questions need related-work neighborhoods, and mechanistic questions need method sections. Our contribution is a learned router that selects the evidence path before generation.
</INTRODUCTION>

<PAPER_BODY>
The full system combines three changes: an evidence router, a citation-graph retriever, and a reranker trained on paper-review pairs. The ablation removes all three changes at once and reports a 6.2 point factuality drop. No experiment removes only the router while keeping the citation graph and reranker fixed.
</PAPER_BODY>

<FIGURE_TABLE_CAPTIONS>
Table 3: Full system vs. Base RAG. The row "w/o adaptive evidence stack" removes router, citation graph, and reranker together.
</FIGURE_TABLE_CAPTIONS>

<REVIEW>
Weakness: The paper claims the new routing module is essential, but the ablation only removes multiple components together.
Question: Can the authors isolate the contribution of the proposed module?
</REVIEW>

For best results, use the same context shape as training: title, abstract, introduction, selected body sections, figure/table captions, and the real review.

Expected ParaReviewer-SFT style:

text
这条 concern 的核心不是“缺少一个表格”，而是审稿人认为论文的因果归因链条没有闭合。
作者声称新模块是关键贡献，但实验只做了组合式 ablation，无法回答“提升到底来自新模块，还是来自同时改变的训练策略/数据处理/其他组件”。

严重程度通常是 major，因为它直接影响核心贡献是否被证明。

怎么修：补一个控制变量实验，只替换或移除该模块，其他训练数据、优化设置、模型容量保持一致。
如果无法补实验，至少要在正文中降低 claim 强度，把“模块带来提升”改成“完整系统带来提升”，并说明当前实验不能隔离单个模块贡献。

可迁移规则：当论文声称 X 是关键创新时，审稿人会寻找只改变 X 的实验。
主动发现方式：检查每个核心 claim 是否都有一条对应的控制变量证据；如果证据同时改变多个因素，这个 claim 就有被质疑的风险。

Less useful generic-model style:

text
The reviewer is asking for more ablation studies. The authors should add experiments and clarify the contribution.
This is a common concern in machine learning papers and can be addressed by improving the experimental section.

The difference is not only length or fluency. The specialist output makes the reviewer's hidden standard explicit: claim isolation under controlled evidence.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "bhxdianzhang/ParaReviewer-SFT"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": "你是顶会审稿人认知建模专家。给定论文上下文和真实审稿意见，请解释审稿人为什么会提出这些 concern，并总结可迁移的审稿判断规则。\n\n<TITLE>...</TITLE>\n\n<ABSTRACT>...</ABSTRACT>\n\n<INTRODUCTION>...</INTRODUCTION>\n\n<PAPER_BODY>...</PAPER_BODY>\n\n<FIGURE_TABLE_CAPTIONS>...</FIGURE_TABLE_CAPTIONS>\n\n<REVIEW>...</REVIEW>",
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=4096, temperature=0.2)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=False))

Limitations

The model is optimized for reviewer-cognition analysis, not general chat.
It may overuse mentor-style framing or fixed section patterns.
It may mis-prioritize shallow review comments such as venue-fit or typo concerns.
It should not be used as an authority on paper correctness without checking the source material.
The training data contains generated teacher reasoning, so inherited annotation errors are possible.

License and Use Restrictions

This repository is marked with license: other.

The model was trained on private ParadoxGPT SFT data derived from parsed academic papers, reviews, and teacher annotations. Rights to original papers and reviews remain with their respective authors, reviewers, venues, and publishers. This private model is provided for internal research and engineering use only. Redistribution or public release should be reviewed separately against source venue policies and applicable copyright rules.

Citation

bibtex
@model{zhang2026parareviewersft,
  title        = {ParaReviewer-SFT: A ParadoxGPT Reviewer Cognition Model},
  author       = {Heng Zhang},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/bhxdianzhang/ParaReviewer-SFT}},
  note         = {Private ParadoxGPT supervised fine-tuned reviewer model}
}

中文

概述

ParaReviewer-SFT 是 ParadoxGPT 的 Reviewer 专家模型，由 Qwen 4B 基座监督微调而来，目标是建模审稿人认知：解释审稿人为什么提出某些 concern，抽取可迁移的审稿判断规则，并帮助作者在投稿前主动发现类似风险。

模型输入论文上下文和真实 review，输出带 thinking 的中文导师式分析。它主要服务于 ParadoxGPT 内部的论文 review 分析、rebuttal 准备和研究智能体训练流程。

适用场景

ParaReviewer-SFT 适合用于：

分析真实审稿意见背后的判断逻辑；
抽取 concern 严重程度、修复方向和审稿规则；
归纳多条 review 中反复出现的风险信号；
支持 ParadoxGPT reviewer/rebuttal 工作流；
生成 reviewer-cognition agent 的训练或评测 trace。

该模型不能替代人类科研判断。输出内容应结合原论文和原 review 复核。

训练数据

模型使用私有数据集 bhxdianzhang/ParaSFT-reviewer 进行监督微调。

Table
Split	样本数
Train	21,561
Dev	1,166
Test	1,356
Total	24,083

任务类型：stage_a_concern_cognition。

质量过滤：每条目标答案都包含且只包含一个配平的 <think>...</think> 思考块，后接最终答案。

训练摘要

基座模型：本地 Qwen 4B base model。

Table
字段	数值
Epochs	3.0
Learning rate	3e-6
Scheduler	cosine
分布式设备数	2
Gradient accumulation	8
Total train batch size	16
Final train loss	0.8205597753
Final eval loss	0.8410579562

定性行为

ParaReviewer-SFT 是一个小规模专家模型，不是通用助手。它在 reviewer-cognition 任务上的提升主要体现在：更快进入任务、更贴近真实 concern、更稳定的中文解释风格，以及更强的修复导向分析。

定性观察：

Table
模型类型	同一论文-review 输入上的常见表现
同一 4B backbone 的 SFT 前模型	经常把大量 token 花在英文任务规划上，然后给出泛化的 review 分析模板。它可能能识别主题，但 concern 严重程度、修复方向和主动发现方式不够稳定。
更大的通用聊天模型	通常表达流畅，也能给出有帮助的总结，但容易停留在“复述 review”，不一定会稳定建模“审稿人为什么认为这件事重要”。通常需要更强 prompt 才能区分“review 说了什么”和“审稿判断标准是什么”。
ParaReviewer-SFT	更直接进入 reviewer-cognition 框架：concern 根因、是否成立、严重程度、怎么修、可迁移审稿规则，以及不看 review 时如何主动发现同类风险。

这说明在这个窄域工作流里，一个 4B 专家模型可以比更大的通用模型更好用，尤其适合把真实 review 转成可复用的 reviewer-risk signal。但这不表示它在通用推理、广泛知识或开放式科研判断上等价于大模型。

推荐用法示例

推荐 prompt 形态：

text
你是顶会审稿人认知建模专家。给定一篇论文和一条真实审稿意见，请用中文解释审稿人为什么会提出这些 concern，
总结可迁移的审稿判断规则，并说明不看 review 时如何主动发现类似问题。回答要像资深导师分析审稿意见。

<TITLE>
Adaptive Evidence Routing for Retrieval-Augmented Scientific QA
</TITLE>

<ABSTRACT>
We propose Adaptive Evidence Routing, a retrieval-augmented QA framework that routes each question to a dense retriever, a symbolic citation graph, or a hybrid path. The paper claims that the routing module is the main reason for improved factuality and citation precision. Experiments report higher answer accuracy on three scientific QA benchmarks.
</ABSTRACT>

<INTRODUCTION>
Prior RAG systems often retrieve one flat evidence set for every query. We argue that scientific questions require different evidence types: definitions need canonical sources, comparison questions need related-work neighborhoods, and mechanistic questions need method sections. Our contribution is a learned router that selects the evidence path before generation.
</INTRODUCTION>

<PAPER_BODY>
The full system combines three changes: an evidence router, a citation-graph retriever, and a reranker trained on paper-review pairs. The ablation removes all three changes at once and reports a 6.2 point factuality drop. No experiment removes only the router while keeping the citation graph and reranker fixed.
</PAPER_BODY>

<FIGURE_TABLE_CAPTIONS>
Table 3: Full system vs. Base RAG. The row "w/o adaptive evidence stack" removes router, citation graph, and reranker together.
</FIGURE_TABLE_CAPTIONS>

<REVIEW>
Weakness: The paper claims the new routing module is essential, but the ablation only removes multiple components together.
Question: Can the authors isolate the contribution of the proposed module?
</REVIEW>

为了获得最佳效果，请尽量使用与训练数据一致的上下文结构：title、abstract、introduction、正文关键片段、figure/table captions 和真实 review。

ParaReviewer-SFT 期望输出风格：

text
这条 concern 的核心不是“缺少一个表格”，而是审稿人认为论文的因果归因链条没有闭合。
作者声称新模块是关键贡献，但实验只做了组合式 ablation，无法回答“提升到底来自新模块，还是来自同时改变的训练策略/数据处理/其他组件”。

严重程度通常是 major，因为它直接影响核心贡献是否被证明。

怎么修：补一个控制变量实验，只替换或移除该模块，其他训练数据、优化设置、模型容量保持一致。
如果无法补实验，至少要在正文中降低 claim 强度，把“模块带来提升”改成“完整系统带来提升”，并说明当前实验不能隔离单个模块贡献。

可迁移规则：当论文声称 X 是关键创新时，审稿人会寻找只改变 X 的实验。
主动发现方式：检查每个核心 claim 是否都有一条对应的控制变量证据；如果证据同时改变多个因素，这个 claim 就有被质疑的风险。

不够理想的通用模型输出形态：

text
The reviewer is asking for more ablation studies. The authors should add experiments and clarify the contribution.
This is a common concern in machine learning papers and can be addressed by improving the experimental section.

差异不只是长度或流畅度，而是专家模型会把审稿人的隐含标准说出来：核心 claim 需要有控制变量证据来隔离贡献来源。

局限

模型针对 reviewer-cognition 分析优化，不是通用聊天模型。
输出可能有固定导师式结构或口吻。
对 venue-fit、typo 等浅层 review comment 的优先级判断仍可能不稳。
不应在未检查原论文和原 review 的情况下，把输出当成科研事实。
训练数据包含 teacher-model reasoning，可能继承标注误差。

引用

bibtex
@model{zhang2026parareviewersft,
  title        = {ParaReviewer-SFT: A ParadoxGPT Reviewer Cognition Model},
  author       = {Heng Zhang},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/bhxdianzhang/ParaReviewer-SFT}},
  note         = {Private ParadoxGPT supervised fine-tuned reviewer model}
}

ParaReviewer-SFT

Get help setting up a custom Dedicated Endpoints.

README