Model details
[!NOTE]
The model was trained on data derived from allenai/Dolci-Think-SFT-32B, released under the ODC-BY-1.0 license.
This model is part of a Chinese specialist trio designed to study the native reasoning gap:
Evaluation
All scores are mean accuracy (%) on the Chinese version of each benchmark, with sample standard deviation across runs. AIME 24/25 is averaged over 30 runs; the others over 10 runs, using the recommended generation parameters.
Table with columns: Model, MGSM-Rev2, Global-MMLU-Lite, GPQA-Diamond, AIME 24/25, HumanEvalPlus, Average| Model | MGSM-Rev2 | Global-MMLU-Lite | GPQA-Diamond | AIME 24/25 | HumanEvalPlus | Average |
|---|
Qwen3-8B-ZH | 88.92 | 74.85 | 50.71 | 53.89 | 85.62 | 70.80 |
Qwen3-8B-ZH-Swap | 88.24 | 76.42 | 52.58 | 55.17 | 85.69 | 71.62 |
Qwen3-8B-ZH-Pivot-EN | 94.84 | 76.15 | 54.19 | 59.06 | 85.19 | 73.89 |
Qwen3-8B-EN | 76.04 | 75.00 | 47.53 | 50.00 | 83.88 | 66.49 |
Benchmarks used:
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "lightonai/Qwen3-8B-ZH"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "计算:24 × 17 = ?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=32768, temperature=1.0, top_p=0.95, top_k=20)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Recommended sampling: temperature=1.0, top_p=0.95, top_k=20, min_p=0.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{lasbordes2026rethinking,
title = {Rethinking the Multilingual Reasoning Gap with Layer Swap},
author = {Lasbordes, Maxence and Chatelain, Amélie and Seddah, Djamé},
year = {2026},
eprint = {2605.26735},
archivePrefix= {arXiv},
primaryClass = {cs.CL}
}