hugruby/mathstral-7b-mismatched-correct-drafts API & Inference Endpoint

How to use

This is a LoRA adapter — load it on top of the base model.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE    = "mistralai/Mathstral-7B-v0.1"
ADAPTER = "hugruby/mathstral-7b-mismatched-correct-drafts"

tok   = AutoTokenizer.from_pretrained(ADAPTER)
model = PeftModel.from_pretrained(
    AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
    ADAPTER,
).eval()

problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
gen = dict(max_new_tokens=4096, do_sample=False)

# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
PROMPT = (
    "Problem: " + problem + "\n\n"
    "Thinking: N/A\n\n"
    "The thinking section may contain errors. Solve the math problem step by step. "
    "Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
    "Correct Solution:"
)
ids = tok(PROMPT, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))

Optional: the `[INST]` chat format (out-of-distribution)

The shipped chat_template.jinja is Mathstral's original [INST] chat template. This adapter was not trained in that format, so apply_chat_template(...) is out-of-distribution and generally underperforms the plain prompt above — it is included only so you can A/B both:

python
ids = tok.apply_chat_template(
    [{"role": "user",
      "content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
    add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))

How it was trained

Trained with Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) using TRL GRPOTrainer on top of Unsloth FastLanguageModel, on the mismatched_correct data config. The reward is binary mathematically_quasi_correct. The correction-bonus, copy-penalty, and corrupt-penalty terms are all 0, and the reward is pure binary.

Training command:

bash
python scripts/train.py \
  --model mistralai/Mathstral-7B-v0.1 \
  --dataset-path data/mismatched_correct \
  --output-dir outputs/mismatched_correct \
  --max-steps 2222 \
  --gradient-accumulation-steps 4 \
  --max-completion-length 4096 \
  --max-seq-length 8192 \
  --learning-rate 5e-6 --lr-scheduler-type constant \
  --beta 0 \
  --correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
  --adam-beta2 0.99 \
  --save-steps 50 --gpu-mem-util 0.5

Hyperparameter	Value
Base model	`mistralai/Mathstral-7B-v0.1`
Method	Dr. GRPO (`loss_type=dr_grpo`, `scale_rewards=False`)
LoRA rank / alpha	r = 16, α = 32 → scaling γ = α/r = 2
LoRA targets / dropout	`q,k,v,o,gate,up,down` (7 projections) / 0.0
KL coefficient β	0
Reward bonuses	correction 0, copy-penalty 0, corrupt-penalty 0
Generations per prompt	16
Per-device batch	1
Gradient accumulation	4 → 4 problems × 16 = 64 completions/step
Learning rate	5e-6, constant schedule
Adam β₂	0.99
Max completion length	4096
Max sequence length *	8192
Max prompt tokens *	— (disabled, no truncation; longest prompt 3,317 tok < 8,192 − 4,096, so the 4,096 max completion length is respected)
Max steps	2222
Released checkpoint	global step 2000 (epoch 0.900)
Random seed	42

* Length budgets across all four variants:

Variant	max-seq-length	max-completion	max-prompt-tokens
mismatched-wrong	7168	4096	3072
matched-wrong	7168	4096	3072
no-draft	7168	4096	disabled (equivalent to 3,072, as all prompts are short)
mismatched-correct	8192	4096	disabled

For a strict apple-to-apple comparison, mismatched-correct should have used --max-seq-length 7168 and --max-prompt-tokens 3072 like the other three variants; the larger 8,192 with the cap left off was an omission. The effect should be negligible though — only 6 of 8,888 prompts exceed 3,072 tokens (longest 3,317), so for the other 8,882 the run is identical to a 7,168 / 3,072 setup. For those 6 the prompt is left untruncated, but the 4,096 max-completion length is still respected and the sequence runs only slightly past 7,168 (at most 3,317 + 4,096 = 7,413, well under 8,192). But to train a precise apple-to-apple version yourself, change --max-seq-length 8192 to 7168 and add --max-prompt-tokens 3072.

Files

adapter_model.safetensors, adapter_config.json — the LoRA adapter (load with PEFT on the base model)
tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json — tokenizer
chat_template.jinja — Mathstral's [INST] template (see the out-of-distribution note above)

Citation

bibtex
@article{deng2026mismatched,
  title   = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
  author  = {Deng, Wei},
  journal = {arXiv preprint arXiv:2605.17314},
  year    = {2026},
  url     = {https://arxiv.org/abs/2605.17314}
}

License

Apache-2.0. The base model (Mathstral-7B-v0.1) and the draft model (Qwen2.5-Math-1.5B) are both Apache-2.0.

mathstral-7b-mismatched-correct-drafts

Get help setting up a custom Dedicated Endpoints.

README

How to use

Optional: the `[INST]` chat format (out-of-distribution)

How it was trained

Files

Citation

License

Explore FriendliAI today

mathstral-7b-mismatched-correct-drafts

mathstral-7b-mismatched-correct-drafts

Get help setting up a custom Dedicated Endpoints.

How to use

Optional: the [INST] chat format (out-of-distribution)

How it was trained

Files

Citation

License

Explore FriendliAI today

mathstral-7b-mismatched-correct-drafts

Optional: the `[INST]` chat format (out-of-distribution)