dogukanvzr
Mergen-TR-Qwen3.5-9B
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Highlights
| Benchmark | Score | Where it lands |
|---|---|---|
| Turkish-MMLU (leaderboard bank, 6,200 q) | 74.3 | Above Llama 3.1 70B (70.4), Qwen3 14B (71.7), Gemma 2 27B (72.1); level with Gemma 3 27B (75.1) |
| TurkishMMLU (Yüksel et al. 2024) · CoT | 84.4 | Above Claude 3 Opus (81.8) and GPT-4 Turbo (79.2) |
| GSM8K-TR (grade-school math) | 87.5 | +1.2 over the base model |


The model sits on the accuracy band of models 3–7× its size. A 9B model matching ~30B-class open models is the headline result.


Mergen-TR-Qwen3.5-9B is consistently strong across the science and humanities curriculum — math, physics, geography and chemistry sit in the high-80s/90s. Turkish language & literature is the weakest single area and the main target for future work.
Evaluation methodology (full transparency)
Thinking-style models are systematically under-scored by the short-generation
protocols used in older harnesses: the reasoning chain does not fit the token
budget, and answer parsers cannot handle <think> blocks. Every number here is
produced with the open protocol below, which anyone can re-run:
- Thinking ON (
enable_thinking=True), greedy decoding (do_sample=False,repetition_penalty=1.05). - Generation budget:
max_new_tokens=8192for Turkish-MMLU and16384for academic TurkishMMLU / GSM8K-TR. This matters — about 17% (Turkish-MMLU) and 25% (academic) of generations exceed 4,096 tokens, so a tight budget measures a lower score than the model actually achieves. - Answer parser (thinking-aware, 3 layers): an explicit marker after
</think>first, then an explicit marker anywhere, then a bare option letter after</think>. Across every runno_answer = 0(no generation was unparseable). - Samples: Turkish-MMLU — section-stratified n=1,895; academic TurkishMMLU — full 900-question set; GSM8K-TR — n=256. Precision bf16.

Comparison rows are not same-harness. Turkish-MMLU competitor scores are taken from the official leaderboard (those runs are mostly Q4 GGUF and use the board's own protocol); academic TurkishMMLU closed-model scores are from the source paper's CoT table. Rows therefore compare against the best publicly reported values, not a single shared harness.
Quickstart
Transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "dogukanvzr/Mergen-TR-Qwen3.5-9B"tok = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")messages = [{"role": "user","content": "Divan edebiyatında kaside ile gazel arasındaki farkları açıkla."}]text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True,enable_thinking=True) # hybrid thinking modeinputs = tok(text, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=8192, do_sample=False,repetition_penalty=1.05)print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
vLLM (serving)
bash
vllm serve dogukanvzr/Mergen-TR-Qwen3.5-9B --dtype bfloat16 --max-model-len 16384
Tips
- For knowledge / reasoning tasks use
enable_thinking=Truewithmax_new_tokens ≥ 8192; a tight budget truncates the reasoning chain and lowers quality. - For short, direct answers you can set
enable_thinking=False. - If you parse answers programmatically, read after the
</think>tag.
Model details
| Base model | Qwen/Qwen3.5-9B |
| Parameters | ~9.5B |
| Languages | Turkish (primary) |
| Context length | up to 16K used in evaluation |
| Precision | bfloat16 |
| Modes | hybrid — thinking / non-thinking |
| Post-training | supervised fine-tuning + preference optimization |
| License | Apache-2.0 |
Training data
Mergen-TR-Qwen3.5-9B was post-trained on a private Turkish corpus built for this project — ~277,000 supervised examples and ~15,000 preference pairs (SFT + preference optimization), constructed natively for Turkish (not bulk machine-translated). Approximate composition:
| Area | ~Examples |
|---|---|
| Mathematics & word problems | 64,000 |
| Knowledge / multiple-choice (incl. humanities) | 64,000 |
| Culture | 20,000 |
| Extractive QA | 15,000 |
| Classification | 13,000 |
| Grammar correction | 13,000 |
| Translation | 12,000 |
| Natural-language inference | 12,000 |
| Summarization | 11,000 |
| Commonsense (HellaSwag-style) | 10,000 |
| Instruction-following, NER / POS & others | 43,000 |
Quality and factuality filtering is applied throughout. The corpus and its construction pipeline are not released. Overlap with evaluation sets is prevented by hash-based decontamination covering all benchmark sources (0 leakage).
Limitations
- The focus is Turkish; in other languages the base model's behavior dominates.
- On multiple-choice factual recall (Turkish-MMLU), Mergen-TR-Qwen3.5-9B essentially matches its strong Qwen3.5-9B base at a fair generation budget — the leaderboard standing reflects the base's knowledge as much as the post-training. The post-training's measurable gains are concentrated in mathematics (+1.2 GSM8K-TR), chain-of-thought reasoning, and Turkish generation, not raw factual recall.
- Thinking-mode generations are long; for latency-sensitive use, prefer the non-thinking mode.
- Benchmark comparisons mix values reported by different harnesses (see Evaluation methodology); they are not strictly same-run comparisons.
References
- Turkish-MMLU leaderboard (alibayram), 6,200-question bank.
- A. Yüksel et al., TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish, 2024 (arXiv:2407.12402).
- GSM8K-TR (malhajar, v0.2).
Model provider
dogukanvzr
Model tree
Base
Qwen/Qwen3.5-9B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information