Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

How to use

This is a LoRA adapter — load it on top of the base model.

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "mistralai/Mathstral-7B-v0.1"
ADAPTER = "hugruby/mathstral-7b-mismatched-correct-drafts"
tok = AutoTokenizer.from_pretrained(ADAPTER)
model = PeftModel.from_pretrained(
AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),
ADAPTER,
).eval()
problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."
gen = dict(max_new_tokens=4096, do_sample=False)
# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):
PROMPT = (
"Problem: " + problem + "\n\n"
"Thinking: N/A\n\n"
"The thinking section may contain errors. Solve the math problem step by step. "
"Write your own correct solution. Put your final answer within \\boxed{}.\n\n"
"Correct Solution:"
)
ids = tok(PROMPT, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))

Optional: the [INST] chat format (out-of-distribution)

The shipped chat_template.jinja is Mathstral's original [INST] chat template. This adapter was not trained in that format, so apply_chat_template(...) is out-of-distribution and generally underperforms the plain prompt above — it is included only so you can A/B both:

python

ids = tok.apply_chat_template(
[{"role": "user",
"content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],
add_generation_prompt=True, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))

How it was trained

Trained with Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) using TRL GRPOTrainer on top of Unsloth FastLanguageModel, on the mismatched_correct data config. The reward is binary mathematically_quasi_correct. The correction-bonus, copy-penalty, and corrupt-penalty terms are all 0, and the reward is pure binary.

Training command:

bash

python scripts/train.py \
--model mistralai/Mathstral-7B-v0.1 \
--dataset-path data/mismatched_correct \
--output-dir outputs/mismatched_correct \
--max-steps 2222 \
--gradient-accumulation-steps 4 \
--max-completion-length 4096 \
--max-seq-length 8192 \
--learning-rate 5e-6 --lr-scheduler-type constant \
--beta 0 \
--correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \
--adam-beta2 0.99 \
--save-steps 50 --gpu-mem-util 0.5
HyperparameterValue
Base modelmistralai/Mathstral-7B-v0.1
MethodDr. GRPO (loss_type=dr_grpo, scale_rewards=False)
LoRA rank / alphar = 16, α = 32 → scaling γ = α/r = 2
LoRA targets / dropoutq,k,v,o,gate,up,down (7 projections) / 0.0
KL coefficient β0
Reward bonusescorrection 0, copy-penalty 0, corrupt-penalty 0
Generations per prompt16
Per-device batch1
Gradient accumulation4 → 4 problems × 16 = 64 completions/step
Learning rate5e-6, constant schedule
Adam β₂0.99
Max completion length4096
Max sequence length *8192
Max prompt tokens *— (disabled, no truncation; longest prompt 3,317 tok < 8,192 − 4,096, so the 4,096 max completion length is respected)
Max steps2222
Released checkpointglobal step 2000 (epoch 0.900)
Random seed42

* Length budgets across all four variants:

Variantmax-seq-lengthmax-completionmax-prompt-tokens
mismatched-wrong716840963072
matched-wrong716840963072
no-draft71684096disabled (equivalent to 3,072, as all prompts are short)
mismatched-correct81924096disabled

For a strict apple-to-apple comparison, mismatched-correct should have used --max-seq-length 7168 and --max-prompt-tokens 3072 like the other three variants; the larger 8,192 with the cap left off was an omission. The effect should be negligible though — only 6 of 8,888 prompts exceed 3,072 tokens (longest 3,317), so for the other 8,882 the run is identical to a 7,168 / 3,072 setup. For those 6 the prompt is left untruncated, but the 4,096 max-completion length is still respected and the sequence runs only slightly past 7,168 (at most 3,317 + 4,096 = 7,413, well under 8,192). But to train a precise apple-to-apple version yourself, change --max-seq-length 8192 to 7168 and add --max-prompt-tokens 3072.

Files

  • adapter_model.safetensors, adapter_config.json — the LoRA adapter (load with PEFT on the base model)
  • tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json — tokenizer
  • chat_template.jinja — Mathstral's [INST] template (see the out-of-distribution note above)

Citation

bibtex

@article{deng2026mismatched,
title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},
author = {Deng, Wei},
journal = {arXiv preprint arXiv:2605.17314},
year = {2026},
url = {https://arxiv.org/abs/2605.17314}
}

License

Apache-2.0. The base model (Mathstral-7B-v0.1) and the draft model (Qwen2.5-Math-1.5B) are both Apache-2.0.

Model provider

hugruby

Model tree

Base

mistralai/Mathstral-7B-v0.1

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today