OpenRubrics/RubricARROW-8B-Judge API & Inference Endpoint

OpenRubrics/RubricARROW-8B-Judge

This is an 8B RubricARROW-Judge model, finetuned from Qwen/Qwen3-8B as introduced in the paper RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "OpenRubrics/RubricARROW-8B-Judge"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")

To evaluate the model, please use the following format to build up message.

Here rubric_item should be generated with a RubricARROW-Rubric model.

Python
JUDGE_PROMPT_TEMPLATE = """
Your job is to look at a conversation and a set of rubric items, and score the last turn (i.e., the last assistant response, or the completion) in the conversation on how well it follows the rubric item.

# Conversation
<<conversation>>

# Rubric item
<<rubric_item>>

# Instructions
Return a json object. For each rubric item i (starting from 1), keys must be exactly "explanation_i" and "criteria_met_i" for each i and it includes two top-level fields in the JSON object:
- The "explanation_i" field should be a string explaining why the response does or does not meet the criteria of the rubric item.
- The "criteria_met_i" field should be a boolean indicating (true/false) whether the response meets the criteria of the rubric item. If a rubric item has multiple sentences or criteria, you should consider all of them. If any of the criteria is not met, the answer should be false. Only return true is all of the criteria are met.
- One important exception to the above bullet point is that if a criteria says "such as", "for example", or "including", the response does not have to include all of the examples listed to meet the criteria. 

# Final Output Format (a single JSON object, not an array)
{
  "explanation_1": "...",
  "criteria_met_1": true/false,
  "explanation_2": "...",
  "criteria_met_2": true/false,
  ... repeat this pattern for every rubric item i in order (i = 1, 2, 3, ...)
}

# Final instruction
Return just the json object. Do not include any other text in the response.
""".strip()

conversation = f"user: {instruction}

assistant: {response}"

user_text = (
    JUDGE_PROMPT_TEMPLATE
    .replace("<<conversation>>", conversation)
    .replace("<<rubric_item>>", rubric_item)
)

messages_list = [
    {"role": "user", "content": user_text},
]
message = tok.apply_chat_template(
    messages_list, 
    tokenize=False, 
    add_generation_prompt=True,
    enable_thinking=False
)

# Remaining step: Use either HF or vLLM for evaluation.
# ...
# ...

For probability-based scoring, we compute the final score as follows:

Python
def weight(tags):
    t = {str(x).strip().lower() for x in (tags or [])}
    return 3.0 if "hard rule" in t else 1.0 if "principle" in t else 0.0

def group_score(rubric_outputs):
    return sum((x.get("true_prob", 0.0) - x.get("false_prob", 0.0)) * weight(x.get("tags"))
               for x in rubric_outputs)

Citation

If you find our work helpful, please consider citing our paper:

bibtex
@misc{jiang2026rubric,
      title={RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains}, 
      author={Haoxiang Jiang and Zihan Dong and Tianci Liu and Wanying Wang and Ran Xu and Tony Yu and Linjun Zhang and Haoyu Wang},
      year={2026},
      eprint={2605.29156},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.29156}, 
}

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "OpenRubrics/RubricARROW-8B-Judge"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")

To evaluate the model, please use the following format to build up message.

Here rubric_item should be generated with a RubricARROW-Rubric model.

Python

JUDGE_PROMPT_TEMPLATE = """
Your job is to look at a conversation and a set of rubric items, and score the last turn (i.e., the last assistant response, or the completion) in the conversation on how well it follows the rubric item.

# Conversation
<<conversation>>

# Rubric item
<<rubric_item>>

# Instructions
Return a json object. For each rubric item i (starting from 1), keys must be exactly "explanation_i" and "criteria_met_i" for each i and it includes two top-level fields in the JSON object:
- The "explanation_i" field should be a string explaining why the response does or does not meet the criteria of the rubric item.
- The "criteria_met_i" field should be a boolean indicating (true/false) whether the response meets the criteria of the rubric item. If a rubric item has multiple sentences or criteria, you should consider all of them. If any of the criteria is not met, the answer should be false. Only return true is all of the criteria are met.
- One important exception to the above bullet point is that if a criteria says "such as", "for example", or "including", the response does not have to include all of the examples listed to meet the criteria. 

# Final Output Format (a single JSON object, not an array)
{
  "explanation_1": "...",
  "criteria_met_1": true/false,
  "explanation_2": "...",
  "criteria_met_2": true/false,
  ... repeat this pattern for every rubric item i in order (i = 1, 2, 3, ...)
}

# Final instruction
Return just the json object. Do not include any other text in the response.
""".strip()

conversation = f"user: {instruction}

assistant: {response}"

user_text = (
    JUDGE_PROMPT_TEMPLATE
    .replace("<<conversation>>", conversation)
    .replace("<<rubric_item>>", rubric_item)
)

messages_list = [
    {"role": "user", "content": user_text},
]
message = tok.apply_chat_template(
    messages_list, 
    tokenize=False, 
    add_generation_prompt=True,
    enable_thinking=False
)

# Remaining step: Use either HF or vLLM for evaluation.
# ...
# ...

For probability-based scoring, we compute the final score as follows:

Python

def weight(tags):
    t = {str(x).strip().lower() for x in (tags or [])}
    return 3.0 if "hard rule" in t else 1.0 if "principle" in t else 0.0

def group_score(rubric_outputs):
    return sum((x.get("true_prob", 0.0) - x.get("false_prob", 0.0)) * weight(x.get("tags"))
               for x in rubric_outputs)

Citation

If you find our work helpful, please consider citing our paper:

bibtex

@misc{jiang2026rubric,
      title={RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains}, 
      author={Haoxiang Jiang and Zihan Dong and Tianci Liu and Wanying Wang and Ran Xu and Tony Yu and Linjun Zhang and Haoyu Wang},
      year={2026},
      eprint={2605.29156},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.29156}, 
}

RubricARROW-8B-Judge

Get help setting up a custom Dedicated Endpoints.

README