dogukanvzr

Mergen-TR-Qwen3.5-9B

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Highlights

Table
BenchmarkScoreWhere it lands
Turkish-MMLU (leaderboard bank, 6,200 q)74.3Above Llama 3.1 70B (70.4), Qwen3 14B (71.7), Gemma 2 27B (72.1); level with Gemma 3 27B (75.1)
TurkishMMLU (Yüksel et al. 2024) · CoT84.4Above Claude 3 Opus (81.8) and GPT-4 Turbo (79.2)
GSM8K-TR (grade-school math)87.5+1.2 over the base model

Turkish-MMLU leaderboard ranking

Accuracy vs parameter count

The model sits on the accuracy band of models 3–7× its size. A 9B model matching ~30B-class open models is the headline result.

Academic TurkishMMLU, chain-of-thought

Per-subject accuracy

Mergen-TR-Qwen3.5-9B is consistently strong across the science and humanities curriculum — math, physics, geography and chemistry sit in the high-80s/90s. Turkish language & literature is the weakest single area and the main target for future work.


Evaluation methodology (full transparency)

Thinking-style models are systematically under-scored by the short-generation protocols used in older harnesses: the reasoning chain does not fit the token budget, and answer parsers cannot handle <think> blocks. Every number here is produced with the open protocol below, which anyone can re-run:

  • Thinking ON (enable_thinking=True), greedy decoding (do_sample=False, repetition_penalty=1.05).
  • Generation budget: max_new_tokens=8192 for Turkish-MMLU and 16384 for academic TurkishMMLU / GSM8K-TR. This matters — about 17% (Turkish-MMLU) and 25% (academic) of generations exceed 4,096 tokens, so a tight budget measures a lower score than the model actually achieves.
  • Answer parser (thinking-aware, 3 layers): an explicit marker after </think> first, then an explicit marker anywhere, then a bare option letter after </think>. Across every run no_answer = 0 (no generation was unparseable).
  • Samples: Turkish-MMLU — section-stratified n=1,895; academic TurkishMMLU — full 900-question set; GSM8K-TR — n=256. Precision bf16.

Effect of the generation budget

Comparison rows are not same-harness. Turkish-MMLU competitor scores are taken from the official leaderboard (those runs are mostly Q4 GGUF and use the board's own protocol); academic TurkishMMLU closed-model scores are from the source paper's CoT table. Rows therefore compare against the best publicly reported values, not a single shared harness.


Quickstart

Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "dogukanvzr/Mergen-TR-Qwen3.5-9B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="bfloat16", device_map="auto")
messages = [{"role": "user",
"content": "Divan edebiyatında kaside ile gazel arasındaki farkları açıkla."}]
text = tok.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=True) # hybrid thinking mode
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=8192, do_sample=False,
repetition_penalty=1.05)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

vLLM (serving)

bash

vllm serve dogukanvzr/Mergen-TR-Qwen3.5-9B --dtype bfloat16 --max-model-len 16384

Tips

  • For knowledge / reasoning tasks use enable_thinking=True with max_new_tokens ≥ 8192; a tight budget truncates the reasoning chain and lowers quality.
  • For short, direct answers you can set enable_thinking=False.
  • If you parse answers programmatically, read after the </think> tag.

Model details

Table
Base modelQwen/Qwen3.5-9B
Parameters~9.5B
LanguagesTurkish (primary)
Context lengthup to 16K used in evaluation
Precisionbfloat16
Modeshybrid — thinking / non-thinking
Post-trainingsupervised fine-tuning + preference optimization
LicenseApache-2.0

Training data

Mergen-TR-Qwen3.5-9B was post-trained on a private Turkish corpus built for this project — ~277,000 supervised examples and ~15,000 preference pairs (SFT + preference optimization), constructed natively for Turkish (not bulk machine-translated). Approximate composition:

Table
Area~Examples
Mathematics & word problems64,000
Knowledge / multiple-choice (incl. humanities)64,000
Culture20,000
Extractive QA15,000
Classification13,000
Grammar correction13,000
Translation12,000
Natural-language inference12,000
Summarization11,000
Commonsense (HellaSwag-style)10,000
Instruction-following, NER / POS & others43,000

Quality and factuality filtering is applied throughout. The corpus and its construction pipeline are not released. Overlap with evaluation sets is prevented by hash-based decontamination covering all benchmark sources (0 leakage).


Limitations

  • The focus is Turkish; in other languages the base model's behavior dominates.
  • On multiple-choice factual recall (Turkish-MMLU), Mergen-TR-Qwen3.5-9B essentially matches its strong Qwen3.5-9B base at a fair generation budget — the leaderboard standing reflects the base's knowledge as much as the post-training. The post-training's measurable gains are concentrated in mathematics (+1.2 GSM8K-TR), chain-of-thought reasoning, and Turkish generation, not raw factual recall.
  • Thinking-mode generations are long; for latency-sensitive use, prefer the non-thinking mode.
  • Benchmark comparisons mix values reported by different harnesses (see Evaluation methodology); they are not strictly same-run comparisons.

References

  • Turkish-MMLU leaderboard (alibayram), 6,200-question bank.
  • A. Yüksel et al., TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish, 2024 (arXiv:2407.12402).
  • GSM8K-TR (malhajar, v0.2).

Model provider

dogukanvzr

Model tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today