Model details
[!NOTE]
The model was trained on data derived from allenai/Dolci-Think-SFT-32B, released under the ODC-BY-1.0 license.
This model is part of a Swahili specialist trio designed to study the native reasoning gap:
Evaluation
All scores are mean accuracy (%) on the Swahili version of each benchmark, with sample standard deviation across runs. AIME 24/25 is averaged over 30 runs; the others over 10 runs, using the recommended generation parameters.
Table with columns: Model, MGSM-Rev2, Global-MMLU-Lite, GPQA-Diamond, AIME 24/25, HumanEvalPlus, Average| Model | MGSM-Rev2 | Global-MMLU-Lite | GPQA-Diamond | AIME 24/25 | HumanEvalPlus | Average |
|---|
Qwen3-8B-SW | 93.16 | 61.98 | 49.39 | 47.67 | 82.69 | 66.98 |
Qwen3-8B-SW-Swap | 96.12 | 64.10 | 49.29 | 50.33 | 85.62 | 69.09 |
Qwen3-8B-SW-Pivot-EN | 89.68 | 66.00 | 52.73 | 59.67 | 84.50 | 70.52 |
Qwen3-8B-EN | 35.88 | 33.88 | 36.82 | 24.78 | 58.44 | 37.96 |
Benchmarks used:
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "lightonai/Qwen3-8B-SW"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [{"role": "user", "content": "Suluhisha: 24 × 17 = ?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=32768, temperature=1.0, top_p=0.95, top_k=20)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Recommended sampling: temperature=1.0, top_p=0.95, top_k=20, min_p=0.
Citation
If you find our work helpful, feel free to give us a cite.
@misc{lasbordes2026rethinking,
title = {Rethinking the Multilingual Reasoning Gap with Layer Swap},
author = {Lasbordes, Maxence and Chatelain, Amélie and Seddah, Djamé},
year = {2026},
eprint = {2605.26735},
archivePrefix= {arXiv},
primaryClass = {cs.CL}
}