Eculid

HealthJudge

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Details

Model type: Causal language model used as a binary helpfulness judge
Base model: lingshu-medical-mllm/Lingshu-7B
Task: Given a social-media post and a candidate note, output whether the note is Helpful or Not Helpful
Output format: Final decision: yes or Final decision: no
Primary domain: English health-related misinformation governance
Intended setting: Human-in-the-loop moderation, evaluation, and research

HealthJudge evaluates the helpfulness of a note. It is not intended to independently verify whether the post, note, or cited evidence is factually correct. In CrowdNotes+, helpfulness is used after separate evidence relevance and correctness checks.

Input Format

The model was trained with a chat-style prompt. A recommended prompt is:

text
You are a precise text classifier.

You are given a Tweet and its corresponding Note:

Tweet: {post}
Note: {note}

The purpose of note is to add helpful context to tweet and keep people better informed.
Your task is to evaluate whether the Note is Helpful or Not Helpful based on the following criteria:

Helpful Criteria:
- Clear and/or well-written
- Cites high-quality sources
- Directly addresses the Tweet's claim
- Provides important context
- Neutral or unbiased language
- Other positive reason

Not Helpful Criteria:
- Incorrect information
- Sources missing or unreliable
- Misses key points or is irrelevant
- Hard to understand
- Argumentative or biased language
- Spam, harassment, or abuse
- Sources do not support note
- Opinion or speculation
- Note not needed on this Tweet
- Other negative reason

Instructions:
1. Carefully read the Tweet and the Note.
2. Analyze the Note using the Helpful and Not Helpful criteria above.
3. Respond with "Final decision: yes" if Helpful or "Final decision: no" if Not Helpful.

Quickstart

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Eculid/HealthJudge"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

post = "..."  # social-media post to evaluate
note = "..."  # candidate Community Note text, without evidence URLs if following the paper setup

messages = [
    {"role": "system", "content": "You are a precise text classifier."},
    {
        "role": "user",
        "content": f"""You are given a Tweet and its corresponding Note:

Tweet: {post}
Note: {note}

The purpose of note is to add helpful context to tweet and keep people better informed.
Your task is to evaluate whether the Note is Helpful or Not Helpful based on the following criteria:

Helpful Criteria:
- Clear and/or well-written
- Cites high-quality sources
- Directly addresses the Tweet's claim
- Provides important context
- Neutral or unbiased language
- Other positive reason

Not Helpful Criteria:
- Incorrect information
- Sources missing or unreliable
- Misses key points or is irrelevant
- Hard to understand
- Argumentative or biased language
- Spam, harassment, or abuse
- Sources do not support note
- Opinion or speculation
- Note not needed on this Tweet
- Other negative reason

Instructions:
1. Carefully read the Tweet and the Note.
2. Analyze the Note using the Helpful and Not Helpful criteria above.
3. Respond with "Final decision: yes" if Helpful or "Final decision: no" if Not Helpful."""
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=32,
    temperature=0.0,
    do_sample=False,
)

response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Expected output:

text
Final decision: yes

text
Final decision: no

Training Data

HealthJudge was trained on human-labeled health-related post–note pairs. The training setup uses the note text without appended evidence URLs so that helpfulness judgments focus on explanatory quality rather than directly judging evidence relevance or evidence correctness.

The dataset used for HealthJudge contains:

Table with columns: Split / Role, Helpful, Not Helpful, Total
Split / Role	Helpful	Not Helpful	Total
All labeled pairs	2,971	742	3,713
Held-out evaluation	800	200	1,000

Each instance was formatted as a chat prompt, and the training loss was applied only to the final decision tokens: Final decision: yes/no.

Training Procedure

HealthJudge was trained using full fine-tuning.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Base model	`lingshu-medical-mllm/Lingshu-7B`
Epochs	2
Optimizer	AdamW
Learning rate	`1e-5`
Gradient accumulation	16
Precision	bfloat16
Objective	Final-decision-token prediction

Evaluation

HealthJudge was evaluated on 1,000 unseen human-labeled post–note pairs.

Table with columns: Model, Macro-F1 (%), Macro-Accuracy (%)
Model	Macro-F1 (%)	Macro-Accuracy (%)
GPT-4.1	74.28	74.19
Gemini-2.5-Flash	68.36	65.13
Claude-Sonnet-4	78.14	76.44
Lingshu-32B	64.71	62.25
Lingshu-7B	51.66	51.63
HealthJudge

These results indicate that HealthJudge better aligns with human helpfulness labels than the compared general-purpose and medical LLM baselines in the reported setup.

Relationship to CrowdNotes+

CrowdNotes+ evaluates generated or human-written notes through a hierarchical pipeline:

Evidence relevance: whether the cited or retrieved evidence is relevant to the flagged post.
Evidence correctness: whether the note accurately represents the evidence.
Note helpfulness: whether the note provides useful context for readers.

HealthJudge is used for the third stage: note helpfulness.

Limitations and Safety

HealthJudge is a decision-support model for research and human-in-the-loop workflows. Important limitations include:

Not a factuality checker: A note may sound helpful but still contain unsupported or inaccurate information. Use separate evidence relevance and correctness checks.
Health-domain scope: The model was developed for English health-related Community Notes. Performance may degrade outside this domain.
Potential automation bias: Users may over-trust model outputs. Human review is required before making moderation or public-facing decisions.
No medical advice: The model does not provide diagnosis, treatment, prevention advice, or clinical recommendations.
Data and platform context: The model reflects patterns in Community Notes-style annotations and may not generalize to all social-media platforms or communities.

For high-stakes use cases, HealthJudge should be paired with expert oversight, transparent evidence review, and domain-specific validation.

Citation

If you use HealthJudge, please cite:

bibtex
@misc{wu2026beyondcrowd,
  title        = {Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation},
  author       = {Jiaying Wu and Zihang Fu and Haonan Wang and Fanxiao Li and Jiafeng Guo and Preslav Nakov and Min-Yen Kan},
  year         = {2026},
  eprint       = {2510.11423},
  archivePrefix = {arXiv},
  primaryClass = {cs.SI},
  url          = {https://arxiv.org/abs/2510.11423}
}

Model provider

Eculid

Model tree

Base

lingshu-medical-mllm/Lingshu-7B

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Model type: Causal language model used as a binary helpfulness judge
Base model: lingshu-medical-mllm/Lingshu-7B
Task: Given a social-media post and a candidate note, output whether the note is Helpful or Not Helpful
Output format: Final decision: yes or Final decision: no
Primary domain: English health-related misinformation governance
Intended setting: Human-in-the-loop moderation, evaluation, and research

Input Format

The model was trained with a chat-style prompt. A recommended prompt is:

text
You are a precise text classifier.

You are given a Tweet and its corresponding Note:

Tweet: {post}
Note: {note}

The purpose of note is to add helpful context to tweet and keep people better informed.
Your task is to evaluate whether the Note is Helpful or Not Helpful based on the following criteria:

Helpful Criteria:
- Clear and/or well-written
- Cites high-quality sources
- Directly addresses the Tweet's claim
- Provides important context
- Neutral or unbiased language
- Other positive reason

Not Helpful Criteria:
- Incorrect information
- Sources missing or unreliable
- Misses key points or is irrelevant
- Hard to understand
- Argumentative or biased language
- Spam, harassment, or abuse
- Sources do not support note
- Opinion or speculation
- Note not needed on this Tweet
- Other negative reason

Instructions:
1. Carefully read the Tweet and the Note.
2. Analyze the Note using the Helpful and Not Helpful criteria above.
3. Respond with "Final decision: yes" if Helpful or "Final decision: no" if Not Helpful.

Quickstart

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Eculid/HealthJudge"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

post = "..."  # social-media post to evaluate
note = "..."  # candidate Community Note text, without evidence URLs if following the paper setup

messages = [
    {"role": "system", "content": "You are a precise text classifier."},
    {
        "role": "user",
        "content": f"""You are given a Tweet and its corresponding Note:

Tweet: {post}
Note: {note}

The purpose of note is to add helpful context to tweet and keep people better informed.
Your task is to evaluate whether the Note is Helpful or Not Helpful based on the following criteria:

Helpful Criteria:
- Clear and/or well-written
- Cites high-quality sources
- Directly addresses the Tweet's claim
- Provides important context
- Neutral or unbiased language
- Other positive reason

Not Helpful Criteria:
- Incorrect information
- Sources missing or unreliable
- Misses key points or is irrelevant
- Hard to understand
- Argumentative or biased language
- Spam, harassment, or abuse
- Sources do not support note
- Opinion or speculation
- Note not needed on this Tweet
- Other negative reason

Instructions:
1. Carefully read the Tweet and the Note.
2. Analyze the Note using the Helpful and Not Helpful criteria above.
3. Respond with "Final decision: yes" if Helpful or "Final decision: no" if Not Helpful."""
    },
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=32,
    temperature=0.0,
    do_sample=False,
)

response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Expected output:

text
Final decision: yes

text
Final decision: no

Training Data

The dataset used for HealthJudge contains:

Table with columns: Split / Role, Helpful, Not Helpful, Total
Split / Role	Helpful	Not Helpful	Total
All labeled pairs	2,971	742	3,713
Held-out evaluation	800	200	1,000

Each instance was formatted as a chat prompt, and the training loss was applied only to the final decision tokens: Final decision: yes/no.

Training Procedure

HealthJudge was trained using full fine-tuning.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Base model	`lingshu-medical-mllm/Lingshu-7B`
Epochs	2
Optimizer	AdamW
Learning rate	`1e-5`
Gradient accumulation	16
Precision	bfloat16
Objective	Final-decision-token prediction

Evaluation

HealthJudge was evaluated on 1,000 unseen human-labeled post–note pairs.

Table with columns: Model, Macro-F1 (%), Macro-Accuracy (%)
Model	Macro-F1 (%)	Macro-Accuracy (%)
GPT-4.1	74.28	74.19
Gemini-2.5-Flash	68.36	65.13
Claude-Sonnet-4	78.14	76.44
Lingshu-32B	64.71	62.25
Lingshu-7B	51.66	51.63
HealthJudge

These results indicate that HealthJudge better aligns with human helpfulness labels than the compared general-purpose and medical LLM baselines in the reported setup.

Relationship to CrowdNotes+

CrowdNotes+ evaluates generated or human-written notes through a hierarchical pipeline:

Evidence relevance: whether the cited or retrieved evidence is relevant to the flagged post.
Evidence correctness: whether the note accurately represents the evidence.
Note helpfulness: whether the note provides useful context for readers.

HealthJudge is used for the third stage: note helpfulness.

Limitations and Safety

HealthJudge is a decision-support model for research and human-in-the-loop workflows. Important limitations include:

Not a factuality checker: A note may sound helpful but still contain unsupported or inaccurate information. Use separate evidence relevance and correctness checks.
Health-domain scope: The model was developed for English health-related Community Notes. Performance may degrade outside this domain.
Potential automation bias: Users may over-trust model outputs. Human review is required before making moderation or public-facing decisions.
No medical advice: The model does not provide diagnosis, treatment, prevention advice, or clinical recommendations.
Data and platform context: The model reflects patterns in Community Notes-style annotations and may not generalize to all social-media platforms or communities.

For high-stakes use cases, HealthJudge should be paired with expert oversight, transparent evidence review, and domain-specific validation.

Citation

If you use HealthJudge, please cite:

bibtex
@misc{wu2026beyondcrowd,
  title        = {Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation},
  author       = {Jiaying Wu and Zihang Fu and Haonan Wang and Fanxiao Li and Jiafeng Guo and Preslav Nakov and Min-Yen Kan},
  year         = {2026},
  eprint       = {2510.11423},
  archivePrefix = {arXiv},
  primaryClass = {cs.SI},
  url          = {https://arxiv.org/abs/2510.11423}
}

HealthJudge

Get help setting up a custom Dedicated Endpoints.

README

Model Details

Input Format

Quickstart

Training Data

Training Procedure

Evaluation

Relationship to CrowdNotes+

Limitations and Safety

Citation

Explore FriendliAI today

README

Model Details

Input Format

Quickstart

Training Data

Training Procedure

Evaluation

Relationship to CrowdNotes+

Limitations and Safety

Citation