bhxdianzhang
ParaChecker-SFT
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherOverview
ParaChecker-SFT is a private ParadoxGPT Checker specialist model fine-tuned from a Qwen 4B base model. It is designed for research-paper workflows rather than general chat.
Intended Use
Use ParaChecker-SFT for:
- paper-level scientific writing and review-support workflows;
- converting paper context into structured reasoning traces;
- internal ParadoxGPT research-agent training and evaluation;
- assisting human researchers with evidence, argumentation, and risk analysis.
Outputs should be checked against the original paper context. This model is not a substitute for human scientific judgment.
Training Data
The model was supervised-fine-tuned on private dataset bhxdianzhang/ParaSFT-checker.
| Split | Samples |
|---|---|
| Train | 35,272 |
| Dev | 2,055 |
| Test | 2,288 |
| Total | 39,615 |
Task distribution:
| Task type | Samples |
|---|---|
K1_commitment_extraction | 7,942 |
K2_claim_support_verification | 7,939 |
K3_cross_section_consistency | 7,911 |
K4_inconsistency_detection | 7,904 |
K5_minimal_fix | 7,919 |
Quality filter: each target answer contains exactly one balanced <think>...</think> reasoning block followed by the final answer.
Training Summary
Base model: Qwen 4B local base model.
| Field | Value |
|---|---|
| Epochs | 3.0 |
| Learning rate | 3e-6 |
| Scheduler | cosine |
| Distributed devices | 2 |
| Gradient accumulation | 8 |
| Total train batch size | 16 |
| Final train loss | 0.8032357919 |
| Final eval loss | 0.8563711047 |
Qualitative Behavior
Local side-by-side checks show clear gains on commitment extraction and claim-support verification, and useful behavior on cross-section consistency and minimal-fix tasks. The main residual issue is K4 inconsistency detection: the model can find useful signals but may over-expand and hit the token limit, so stricter decoding or later preference/RL tuning is recommended for production use.
ParaChecker-SFT is best used with rich paper context matching the training shape, such as title, abstract, introduction, selected body sections, figure/table captions, claims, reviews, or task-specific paper context depending on the skill.
Example Prompt Shape
text
你是顶会论文分析专家。给定论文上下文,请按指定任务做中文分析。论文标题: ...摘要: ...引言: ...正文关键片段: ...图表/实验/claim 信息: ...任务: ...
Concrete Example
Input:
text
你是顶会论文论证闭环检查专家。给定论文完整上下文,请检验每个核心 claim 是否被方法和实验充分支撑。论文标题: Adaptive Evidence Routing for Retrieval-Augmented Scientific QA摘要: The paper claims that an adaptive evidence router improves factuality and citation precision in scientific QA.引言: The motivation is that different question types need different evidence sources.方法: The system adds three components over Base RAG: an evidence router, a citation-graph retriever, and a reranker.实验: Table 3 reports Full System vs. Base RAG. The only ablation, "w/o adaptive evidence stack", removes all three components together.Conclusion: The routing module is the key driver of factuality gains.
Expected ParaChecker-SFT-style output:
text
这里最大的论证缺口是 conclusion 把系统级提升归因给 router,但实验没有隔离 router。Commitment 1: router improves factuality.支撑强度:weak。Table 3 只能证明 full evidence stack 比 Base RAG 好,不能证明 router 单独有效。因为 ablation 同时移除了 router、citation graph 和 reranker。Commitment 2: citation graph improves citation precision.支撑强度:unclear。论文报告了 citation precision 提升,但没有单独加入 citation graph 的对照,因此无法排除 reranker 或 router 带来的影响。最小修复:增加 controlled ablation:Base RAG、+citation graph、+reranker、+router,并保证训练数据、模型大小、解码设置一致。如果来不及补实验,至少把 conclusion 改成“the full evidence stack improves factuality”,不要说 router 是 key driver。
Limitations
- The model is optimized for a narrow ParadoxGPT specialist workflow, not general chat.
- It may inherit teacher-model annotation errors from the SFT data.
- It should not be used as an authority on paper correctness without source verification.
- K4 inconsistency detection may over-generate; monitor finish reason and consider lower max tokens or stricter answer format for deployment.
License and Use Restrictions
This repository is marked with license: other.
The model was trained on private ParadoxGPT SFT data derived from parsed academic papers, reviews, and teacher annotations. Rights to original papers and reviews remain with their respective authors, reviewers, venues, and publishers. This private model is provided for internal research and engineering use only. Redistribution or public release should be reviewed separately against source venue policies and applicable copyright rules.
Citation
bibtex
@model{zhang2026paracheckersft,title = {ParaChecker-SFT: A ParadoxGPT Checker Specialist Model},author = {Heng Zhang},year = {2026},publisher = {Hugging Face},howpublished = {https://huggingface.co/bhxdianzhang/ParaChecker-SFT},note = {Private ParadoxGPT supervised fine-tuned Checker model}}
中文
概述
ParaChecker-SFT 是 ParadoxGPT 的 Checker 专家模型,由 Qwen 4B 基座监督微调而来。它面向科研论文工作流,不是通用聊天模型。
适用场景
ParaChecker-SFT 适合用于:
- 论文级科研写作与 review-support 工作流;
- 将论文上下文转成结构化 reasoning trace;
- ParadoxGPT 内部 research-agent 训练与评测;
- 辅助研究者做证据、论证和风险分析。
输出应结合原论文上下文复核,不能替代人类科研判断。
训练数据
模型使用私有数据集 bhxdianzhang/ParaSFT-checker 进行监督微调。
| Split | 样本数 |
|---|---|
| Train | 35,272 |
| Dev | 2,055 |
| Test | 2,288 |
| Total | 39,615 |
任务分布:
| Task type | Samples |
|---|---|
K1_commitment_extraction | 7,942 |
K2_claim_support_verification | 7,939 |
K3_cross_section_consistency | 7,911 |
K4_inconsistency_detection | 7,904 |
K5_minimal_fix | 7,919 |
质量过滤:每条目标答案都包含且只包含一个配平的 <think>...</think> 思考块,后接最终答案。
训练摘要
基座模型:本地 Qwen 4B base model。
| 字段 | 数值 |
|---|---|
| Epochs | 3.0 |
| Learning rate | 3e-6 |
| Scheduler | cosine |
| 分布式设备数 | 2 |
| Gradient accumulation | 8 |
| Total train batch size | 16 |
| Final train loss | 0.8032357919 |
| Final eval loss | 0.8563711047 |
定性行为
本地 side-by-side 检查显示,ParaChecker-SFT 在 commitment extraction 和 claim-support verification 上提升明显,在 cross-section consistency 与 minimal-fix 任务上也有可用表现。主要残留问题集中在 K4 inconsistency detection:模型能找到有用信号,但可能过度展开并触发 token 上限;生产使用建议配合更严格的 decoding 或后续 preference/RL 训练。
为了获得最佳效果,请使用与训练数据一致的丰富论文上下文,例如 title、abstract、introduction、正文关键片段、figure/table captions、claims、reviews 或具体任务所需的 paper context。
具体示例
输入:
text
你是顶会论文论证闭环检查专家。给定论文完整上下文,请检验每个核心 claim 是否被方法和实验充分支撑。论文标题: Adaptive Evidence Routing for Retrieval-Augmented Scientific QA摘要: 论文声称 adaptive evidence router 能提升 scientific QA 的 factuality 和 citation precision。引言: 论文动机是不同问题类型需要不同证据来源。方法: 系统相比 Base RAG 增加三个组件:evidence router、citation-graph retriever 和 reranker。实验: Table 3 只报告 Full System vs. Base RAG。唯一 ablation 是 "w/o adaptive evidence stack",同时移除三个组件。Conclusion: The routing module is the key driver of factuality gains.
ParaChecker-SFT 期望输出片段:
text
这里最大的论证缺口是 conclusion 把系统级提升归因给 router,但实验没有隔离 router。Commitment 1: router improves factuality.支撑强度:weak。Table 3 只能证明 full evidence stack 比 Base RAG 好,不能证明 router 单独有效。因为 ablation 同时移除了 router、citation graph 和 reranker。Commitment 2: citation graph improves citation precision.支撑强度:unclear。论文报告了 citation precision 提升,但没有单独加入 citation graph 的对照,因此无法排除 reranker 或 router 带来的影响。最小修复:增加 controlled ablation:Base RAG、+citation graph、+reranker、+router,并保证训练数据、模型大小、解码设置一致。如果来不及补实验,至少把 conclusion 改成“the full evidence stack improves factuality”,不要说 router 是 key driver。
局限
- 模型针对 ParadoxGPT 窄域专家工作流优化,不是通用聊天模型。
- 模型可能继承 SFT 数据中的 teacher-model 标注误差。
- 不应在未检查原论文上下文的情况下,把输出当成科研事实。
- K4 inconsistency detection 可能过度生成;部署时应监控 finish reason,并考虑降低 max tokens 或使用更严格的回答格式。
引用
bibtex
@model{zhang2026paracheckersft,title = {ParaChecker-SFT: A ParadoxGPT Checker Specialist Model},author = {Heng Zhang},year = {2026},publisher = {Hugging Face},howpublished = {https://huggingface.co/bhxdianzhang/ParaChecker-SFT},note = {Private ParadoxGPT supervised fine-tuned Checker model}}
Model provider
bhxdianzhang
Model tree
Base
Qwen/Qwen3.5-4B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information