Benchmark Scores
Table with columns: Benchmark, Score, Method, Target| Benchmark | Score | Method | Target |
|---|
| ThaiExam (cons@8) | 56.99% | 322/565 | 85 |
| xCOPA-TH (cons@8) | 79.00% | 79/100 | 80 |
| Belebele-TH (cons@4) | 84.33% ✅ | 759/900 | 80 |
| MATH500 (cons@4) | 64.80% ✅ | 324/500 | 60 |
| AIME24 (cons@8) | 13.33% | 4/30 | 25 (route to base R1) |
Two-model routing recommended: this model for Thai-language tasks (Belebele, ThaiExam, OpenThaiEval, IFEval-TH), base DeepSeek-R1-Distill-Qwen-14B for AIME24-style hard math (base scores 30% on AIME24 cons@8).
Recipe
merge_method: dare_linear
dtype: bfloat16
tokenizer_source: base
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
slices:
- sources:
- model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
layer_range: [0, 32]
parameters: {weight: 0.75, density: 0.7}
- model: Jnx03/kanitakorn-r1d-qwen14b-v10-merged-fp16-edu
layer_range: [0, 32]
parameters: {weight: 0.25, density: 0.7}
- sources:
- model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
layer_range: [32, 48]
parameters: {weight: 0.4, density: 0.7}
- model: Jnx03/kanitakorn-r1d-qwen14b-v10-merged-fp16-edu
layer_range: [32, 48]
parameters: {weight: 0.6, density: 0.7}
The v10-merged-fp16-edu source is a dense merge of R1-Distill-14B + v10 LoRA (a Thai-SFT adapter trained on 6209 records of P1-M6 curriculum + ThaiExam-style + OpenThaiEval).
Use case
Primary audience: Thai students (P1-M6, grades 1-12) and teachers.
Capabilities:
- Multiple-choice exam reasoning in Thai (ThaiExam, O-NET, TGAT, A-Level styles)
- Reading comprehension (Belebele-TH 84%)
- Math reasoning (MATH500 65%)
- Causal commonsense reasoning in Thai (xCOPA-TH 79%)
- Curriculum-aligned Q&A for Thai education topics
Inference
from vllm import LLM, SamplingParams
llm = LLM(
model="Jnx03/kanitakorn-r1d-qwen14b-v11-dare-edu",
dtype="bfloat16", max_model_len=4096, gpu_memory_utilization=0.90,
)
sampling = SamplingParams(n=8, temperature=0.6, top_p=0.95, max_tokens=2048)
prompt = "..."
outputs = llm.generate([prompt], sampling)
Limitations
- ThaiExam at 56.99% remains below the campaign 85 target; the gap to published Thai LLM SOTA (~73%) is real
- AIME24 regressed from base R1's 30% to 13% after DARE-linear merge; for hardest math, prefer base model
- IFEval-TH not yet measured
- May produce English-mixed chain-of-thought on some inputs despite Thai-prefix prompting
Citation
If you use this model, please cite the SCB10X recipe:
Pipatanakul, K., et al. Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging. arXiv:2502.09056, 2025.
And the campaign log: https://github.com/JNX03/kanitakorn-research
License: MIT (inherited from DeepSeek-R1-Distill-Qwen-14B)