Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
OpenRubrics/RubricARROW-8B-Judge
This is an 8B RubricARROW-Judge model, finetuned from Qwen/Qwen3-8B as introduced in the paper RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "OpenRubrics/RubricARROW-8B-Judge"tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
To evaluate the model, please use the following format to build up message.
Here rubric_item should be generated with a RubricARROW-Rubric model.
Python
JUDGE_PROMPT_TEMPLATE = """Your job is to look at a conversation and a set of rubric items, and score the last turn (i.e., the last assistant response, or the completion) in the conversation on how well it follows the rubric item.# Conversation<<conversation>># Rubric item<<rubric_item>># InstructionsReturn a json object. For each rubric item i (starting from 1), keys must be exactly "explanation_i" and "criteria_met_i" for each i and it includes two top-level fields in the JSON object:- The "explanation_i" field should be a string explaining why the response does or does not meet the criteria of the rubric item.- The "criteria_met_i" field should be a boolean indicating (true/false) whether the response meets the criteria of the rubric item. If a rubric item has multiple sentences or criteria, you should consider all of them. If any of the criteria is not met, the answer should be false. Only return true is all of the criteria are met.- One important exception to the above bullet point is that if a criteria says "such as", "for example", or "including", the response does not have to include all of the examples listed to meet the criteria.# Final Output Format (a single JSON object, not an array){"explanation_1": "...","criteria_met_1": true/false,"explanation_2": "...","criteria_met_2": true/false,... repeat this pattern for every rubric item i in order (i = 1, 2, 3, ...)}# Final instructionReturn just the json object. Do not include any other text in the response.""".strip()conversation = f"user: {instruction}assistant: {response}"user_text = (JUDGE_PROMPT_TEMPLATE.replace("<<conversation>>", conversation).replace("<<rubric_item>>", rubric_item))messages_list = [{"role": "user", "content": user_text},]message = tok.apply_chat_template(messages_list,tokenize=False,add_generation_prompt=True,enable_thinking=False)# Remaining step: Use either HF or vLLM for evaluation.# ...# ...
For probability-based scoring, we compute the final score as follows:
Python
def weight(tags):t = {str(x).strip().lower() for x in (tags or [])}return 3.0 if "hard rule" in t else 1.0 if "principle" in t else 0.0def group_score(rubric_outputs):return sum((x.get("true_prob", 0.0) - x.get("false_prob", 0.0)) * weight(x.get("tags"))for x in rubric_outputs)
Citation
If you find our work helpful, please consider citing our paper:
bibtex
@misc{jiang2026rubric,title={RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains},author={Haoxiang Jiang and Zihan Dong and Tianci Liu and Wanying Wang and Ran Xu and Tony Yu and Linjun Zhang and Haoyu Wang},year={2026},eprint={2605.29156},archivePrefix={arXiv},primaryClass={cs.LG},url={https://arxiv.org/abs/2605.29156},}
Model provider
OpenRubrics
Model tree
Base
Qwen/Qwen3-8B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information