dogukanvzr/Mergen-TR-Qwen3.5-9B API & Inference Endpoint

Highlights

Table
Benchmark	Score	Where it lands
Turkish-MMLU (leaderboard bank, 6,200 q)	74.3	Above Llama 3.1 70B (70.4), Qwen3 14B (71.7), Gemma 2 27B (72.1); level with Gemma 3 27B (75.1)
TurkishMMLU (Yüksel et al. 2024) · CoT	84.4	Above Claude 3 Opus (81.8) and GPT-4 Turbo (79.2)
GSM8K-TR (grade-school math)	87.5	+1.2 over the base model

Turkish-MMLU leaderboard ranking

Accuracy vs parameter count

The model sits on the accuracy band of models 3–7× its size. A 9B model matching ~30B-class open models is the headline result.

Academic TurkishMMLU, chain-of-thought

Per-subject accuracy

Mergen-TR-Qwen3.5-9B is consistently strong across the science and humanities curriculum — math, physics, geography and chemistry sit in the high-80s/90s. Turkish language & literature is the weakest single area and the main target for future work.

Evaluation methodology (full transparency)

Thinking-style models are systematically under-scored by the short-generation protocols used in older harnesses: the reasoning chain does not fit the token budget, and answer parsers cannot handle <think> blocks. Every number here is produced with the open protocol below, which anyone can re-run:

Thinking ON (enable_thinking=True), greedy decoding (do_sample=False, repetition_penalty=1.05).
Generation budget: max_new_tokens=8192 for Turkish-MMLU and 16384 for academic TurkishMMLU / GSM8K-TR. This matters — about 17% (Turkish-MMLU) and 25% (academic) of generations exceed 4,096 tokens, so a tight budget measures a lower score than the model actually achieves.
Answer parser (thinking-aware, 3 layers): an explicit marker after </think> first, then an explicit marker anywhere, then a bare option letter after </think>. Across every run no_answer = 0 (no generation was unparseable).
Samples: Turkish-MMLU — section-stratified n=1,895; academic TurkishMMLU — full 900-question set; GSM8K-TR — n=256. Precision bf16.

Effect of the generation budget

Comparison rows are not same-harness. Turkish-MMLU competitor scores are taken from the official leaderboard (those runs are mostly Q4 GGUF and use the board's own protocol); academic TurkishMMLU closed-model scores are from the source paper's CoT table. Rows therefore compare against the best publicly reported values, not a single shared harness.

Quickstart

Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "dogukanvzr/Mergen-TR-Qwen3.5-9B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="bfloat16", device_map="auto")

messages = [{"role": "user",
             "content": "Divan edebiyatında kaside ile gazel arasındaki farkları açıkla."}]
text = tok.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=True)               # hybrid thinking mode
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=8192, do_sample=False,
                     repetition_penalty=1.05)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM (serving)

bash
vllm serve dogukanvzr/Mergen-TR-Qwen3.5-9B --dtype bfloat16 --max-model-len 16384

Tips

For knowledge / reasoning tasks use enable_thinking=True with max_new_tokens ≥ 8192; a tight budget truncates the reasoning chain and lowers quality.
For short, direct answers you can set enable_thinking=False.
If you parse answers programmatically, read after the </think> tag.

Model details

Table

Base model	Qwen/Qwen3.5-9B
Parameters	~9.5B
Languages	Turkish (primary)
Context length	up to 16K used in evaluation
Precision	bfloat16
Modes	hybrid — thinking / non-thinking
Post-training	supervised fine-tuning + preference optimization
License	Apache-2.0

Training data

Mergen-TR-Qwen3.5-9B was post-trained on a private Turkish corpus built for this project — ~277,000 supervised examples and ~15,000 preference pairs (SFT + preference optimization), constructed natively for Turkish (not bulk machine-translated). Approximate composition:

Table
Area	~Examples
Mathematics & word problems	64,000
Knowledge / multiple-choice (incl. humanities)	64,000
Culture	20,000
Extractive QA	15,000
Classification	13,000
Grammar correction	13,000
Translation	12,000
Natural-language inference	12,000
Summarization	11,000
Commonsense (HellaSwag-style)	10,000
Instruction-following, NER / POS & others	43,000

Quality and factuality filtering is applied throughout. The corpus and its construction pipeline are not released. Overlap with evaluation sets is prevented by hash-based decontamination covering all benchmark sources (0 leakage).

Limitations

The focus is Turkish; in other languages the base model's behavior dominates.
On multiple-choice factual recall (Turkish-MMLU), Mergen-TR-Qwen3.5-9B essentially matches its strong Qwen3.5-9B base at a fair generation budget — the leaderboard standing reflects the base's knowledge as much as the post-training. The post-training's measurable gains are concentrated in mathematics (+1.2 GSM8K-TR), chain-of-thought reasoning, and Turkish generation, not raw factual recall.
Thinking-mode generations are long; for latency-sensitive use, prefer the non-thinking mode.
Benchmark comparisons mix values reported by different harnesses (see Evaluation methodology); they are not strictly same-run comparisons.

References

Turkish-MMLU leaderboard (alibayram), 6,200-question bank.
A. Yüksel et al., TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish, 2024 (arXiv:2407.12402).
GSM8K-TR (malhajar, v0.2).

Mergen-TR-Qwen3.5-9B

Get help setting up a custom Dedicated Endpoints.

README