Model details
[!NOTE]
The model was trained on data derived from allenai/Dolci-Think-SFT-32B, released under the ODC-BY-1.0 license.
This model is part of a French specialist trio designed to study the native reasoning gap:
Evaluation
All scores are mean accuracy (%) on the French version of each benchmark, with sample standard deviation across runs. AIME 24/25 is averaged over 30 runs; the others over 10 runs, using the recommended generation parameters.
Table with columns: Model, MGSM-Rev2, Global-MMLU-Lite, GPQA-Diamond, AIME 24/25, HumanEvalPlus, Average| Model | MGSM-Rev2 | Global-MMLU-Lite | GPQA-Diamond | AIME 24/25 | HumanEvalPlus | Average |
|---|
Qwen3-8B-FR | 92.80 | 76.45 | 53.59 | 55.67 | 83.31 | 72.36 |
Qwen3-8B-FR-Swap | 97.40 | 76.57 | 54.55 | 59.11 | 86.06 | 74.74 |
Qwen3-8B-FR-Pivot-EN | 94.52 | 78.37 | 54.65 | 62.78 | 84.88 | 75.04 |
Qwen3-8B-EN | 95.72 | 77.50 | 52.53 | 61.39 | 84.19 | 74.27 |
Benchmarks used:
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "lightonai/Qwen3-8B-FR"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Résous : 24 × 17 = ?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=32768, temperature=1.0, top_p=0.95, top_k=20)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Recommended sampling: temperature=1.0, top_p=0.95, top_k=20, min_p=0.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{lasbordes2026rethinking,
title = {Rethinking the Multilingual Reasoning Gap with Layer Swap},
author = {Lasbordes, Maxence and Chatelain, Amélie and Seddah, Djamé},
year = {2026},
eprint = {2605.26735},
archivePrefix= {arXiv},
primaryClass = {cs.CL}
}