eagle0504/multireward-grpo-fintech-single-qwen2.5-1.5b API & Inference Endpoint

Advantage formulation

SINGLE = Single-reward GRPO: ignore the other channels, train on correctness only (multi-reward-ablation baseline)

This is the difference vs the AN baseline:

	AN (baseline)	NA (paper's recommendation)
Step 1	aggregate channels: `s_j = w^T r_j`	per-channel normalize: `z_jl = (r_jl - mean_l) / std_l`
Step 2	group-normalize: `A_j = (s_j - mean) / std`	aggregate: `A_j = sum_l w_l z_jl`
Influence	dominated by high-σ channel (Prop 1)	weight-proportional (Prop 1)
MSE floor	sensitive to scalarization	`(τ²/m) w^T C w` (Thm 3)

Training data

huggingface.co/datasets/eagle0504/multireward-grpo-fintech-customer-comms — synthetic fintech customer-service conversations, 300 scenarios, with reward channels:

compliance (binary): the harder gate
politeness_gated (continuous): gated by compliance
action (binary): clear next-step indicator

How to use

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "Qwen/Qwen2.5-1.5B-Instruct"
adapter = "eagle0504/multireward-grpo-fintech-single-qwen2.5-1.5b"

tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype="bfloat16")
model = PeftModel.from_pretrained(model, adapter)
model.eval()
# ... generate ...

Hyperparameters (from training)

See metrics.json in the repo for the full training trajectory (loss, PG loss, KL, per-step reward).

Citation

bibtex
@misc{yin2026multireward,
  title={Conditioned Multi-Reward Advantage Estimation: A Finite-Sample Analysis},
  author={Yin, Yiqiao},
  year={2026},
}

License

Apache-2.0 (matches the Qwen base model).

multireward-grpo-fintech-single-qwen2.5-1.5b

Get help setting up a custom Dedicated Endpoints.

README