Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

OpenRubrics/RubricARROW-8B-Judge

This is an 8B RubricARROW-Judge model, finetuned from Qwen/Qwen3-8B as introduced in the paper RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains.

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "OpenRubrics/RubricARROW-8B-Judge"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")

To evaluate the model, please use the following format to build up message.

Here rubric_item should be generated with a RubricARROW-Rubric model.

Python

JUDGE_PROMPT_TEMPLATE = """
Your job is to look at a conversation and a set of rubric items, and score the last turn (i.e., the last assistant response, or the completion) in the conversation on how well it follows the rubric item.
# Conversation
<<conversation>>
# Rubric item
<<rubric_item>>
# Instructions
Return a json object. For each rubric item i (starting from 1), keys must be exactly "explanation_i" and "criteria_met_i" for each i and it includes two top-level fields in the JSON object:
- The "explanation_i" field should be a string explaining why the response does or does not meet the criteria of the rubric item.
- The "criteria_met_i" field should be a boolean indicating (true/false) whether the response meets the criteria of the rubric item. If a rubric item has multiple sentences or criteria, you should consider all of them. If any of the criteria is not met, the answer should be false. Only return true is all of the criteria are met.
- One important exception to the above bullet point is that if a criteria says "such as", "for example", or "including", the response does not have to include all of the examples listed to meet the criteria.
# Final Output Format (a single JSON object, not an array)
{
"explanation_1": "...",
"criteria_met_1": true/false,
"explanation_2": "...",
"criteria_met_2": true/false,
... repeat this pattern for every rubric item i in order (i = 1, 2, 3, ...)
}
# Final instruction
Return just the json object. Do not include any other text in the response.
""".strip()
conversation = f"user: {instruction}
assistant: {response}"
user_text = (
JUDGE_PROMPT_TEMPLATE
.replace("<<conversation>>", conversation)
.replace("<<rubric_item>>", rubric_item)
)
messages_list = [
{"role": "user", "content": user_text},
]
message = tok.apply_chat_template(
messages_list,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
# Remaining step: Use either HF or vLLM for evaluation.
# ...
# ...

For probability-based scoring, we compute the final score as follows:

Python

def weight(tags):
t = {str(x).strip().lower() for x in (tags or [])}
return 3.0 if "hard rule" in t else 1.0 if "principle" in t else 0.0
def group_score(rubric_outputs):
return sum((x.get("true_prob", 0.0) - x.get("false_prob", 0.0)) * weight(x.get("tags"))
for x in rubric_outputs)

Citation

If you find our work helpful, please consider citing our paper:

bibtex

@misc{jiang2026rubric,
title={RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains},
author={Haoxiang Jiang and Zihan Dong and Tianci Liu and Wanying Wang and Ran Xu and Tony Yu and Linjun Zhang and Haoyu Wang},
year={2026},
eprint={2605.29156},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.29156},
}

Model provider

OpenRubrics

Model tree

Base

Qwen/Qwen3-8B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today