Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0How to use
This is a LoRA adapter — load it on top of the base model.
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelBASE = "mistralai/Mathstral-7B-v0.1"ADAPTER = "hugruby/mathstral-7b-mismatched-wrong-drafts"tok = AutoTokenizer.from_pretrained(ADAPTER)model = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto"),ADAPTER,).eval()problem = "If $x+y=6$ and $xy=5$, find $x^2+y^2$."gen = dict(max_new_tokens=4096, do_sample=False)# CANONICAL — the plain draft-free prompt the model was trained and evaluated on (no [INST]):PROMPT = ("Problem: " + problem + "\n\n""Thinking: N/A\n\n""The thinking section may contain errors. Solve the math problem step by step. ""Write your own correct solution. Put your final answer within \\boxed{}.\n\n""Correct Solution:")ids = tok(PROMPT, return_tensors="pt").to(model.device)print(tok.decode(model.generate(**ids, **gen)[0][ids.input_ids.shape[1]:], skip_special_tokens=True))
Optional: the [INST] chat format (out-of-distribution)
The shipped chat_template.jinja is Mathstral's original [INST] chat template. This adapter was not trained in that format, so apply_chat_template(...) is out-of-distribution and generally underperforms the plain prompt above — it is included only so you can A/B both:
python
ids = tok.apply_chat_template([{"role": "user","content": problem + "\n\nPlease reason step by step, and put your final answer within \\boxed{}."}],add_generation_prompt=True, return_tensors="pt").to(model.device)print(tok.decode(model.generate(ids, **gen)[0][ids.shape[1]:], skip_special_tokens=True))
How it was trained
Trained with Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) using TRL GRPOTrainer on top of Unsloth FastLanguageModel, on the mismatched_wrong data config. The reward is binary mathematically_quasi_correct. The correction-bonus, copy-penalty, and corrupt-penalty terms are all 0, and the reward is pure binary.
Training command:
bash
python scripts/train.py \--model mistralai/Mathstral-7B-v0.1 \--dataset-path data/mismatched_wrong \--output-dir outputs/mismatched_wrong \--max-steps 2222 \--gradient-accumulation-steps 4 \--max-completion-length 4096 \--max-seq-length 7168 \--max-prompt-tokens 3072 \--learning-rate 5e-6 --lr-scheduler-type constant \--beta 0 \--correction-bonus 0.0 --copy-penalty 0.0 --corrupt-penalty 0.0 \--adam-beta2 0.99 \--save-steps 50 --gpu-mem-util 0.5
| Hyperparameter | Value |
|---|---|
| Base model | mistralai/Mathstral-7B-v0.1 |
| Method | Dr. GRPO (loss_type=dr_grpo, scale_rewards=False) |
| LoRA rank / alpha | r = 16, α = 32 → scaling γ = α/r = 2 |
| LoRA targets / dropout | q,k,v,o,gate,up,down (7 projections) / 0.0 |
| KL coefficient β | 0 |
| Reward bonuses | correction 0, copy-penalty 0, corrupt-penalty 0 |
| Generations per prompt | 16 |
| Per-device batch | 1 |
| Gradient accumulation | 4 → 4 problems × 16 = 64 completions/step |
| Learning rate | 5e-6, constant schedule |
| Adam β₂ | 0.99 |
| Max completion length | 4096 |
| Max sequence length | 7168 |
| Max prompt tokens | 3072 |
| Max steps | 2222 |
| Released checkpoint | global step 2000 (epoch 0.900) |
| Random seed | 42 |
Files
adapter_model.safetensors,adapter_config.json— the LoRA adapter (load with PEFT on the base model)tokenizer.json,tokenizer.model,tokenizer_config.json,special_tokens_map.json— tokenizerchat_template.jinja— Mathstral's[INST]template (see the out-of-distribution note above)
Citation
bibtex
@article{deng2026mismatched,title = {Weak-to-Strong Elicitation via Mismatched Wrong Drafts},author = {Deng, Wei},journal = {arXiv preprint arXiv:2605.17314},year = {2026},url = {https://arxiv.org/abs/2605.17314}}
License
Apache-2.0. The base model (Mathstral-7B-v0.1) and the draft model (Qwen2.5-Math-1.5B) are both Apache-2.0.
Model provider
hugruby
Model tree
Base
mistralai/Mathstral-7B-v0.1
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information