AbDhumal/orpheus-3b-turkish-tts-v2 API & Inference Endpoint

Results & MLflow

Post-train benchmark: pretrained baseline vs merged final/ · Whisper-small ASR · generation temp=0.3, top_p=0.9, rep_penalty=1.15.

Metric	Baseline	Finetuned	Δ
WER mean	1.690	0.879	−0.811
CER mean	1.317	0.398	−0.919
RTF mean	2.0	2.3	—
eval_loss (training)	9.50 →	4.35	@ step 9,400

MLflow: experiment orpheus-turkish-tts · train 6804b44335f347849f26da2736aa73df · eval-v3 f10b12f80a014ba88bf77cf874789dad

Chart	What it shows
dashboard	4-panel: loss, in-training WER/CER, post-train means, CER Δ
eval_wer_cer_bars	Per-sentence WER & CER (10 phrases)
eval_cer_delta	CER improvement per sentence
training_loss	train/eval loss curve
wer_cer_progress	In-training mean WER/CER (4 airline prompts)
wer_per_prompt	Per-prompt WER during training
cer_per_prompt	Per-prompt CER during training

Per-sentence benchmark (eval/eval_results.json):

Phrase	B-WER	F-WER	B-CER	F-CER	Δ CER
welcome	1.00	0.67	0.88	0.42	+0.46
directions	1.00	0.75	0.93	0.60	+0.33
news_intro	1.00	0.88	0.89	0.70	+0.19
emergency	1.00	1.00	0.93	0.74	+0.20
weather	1.00	1.00	1.00	0.87	+0.13
tech	1.00	1.00	0.92	0.87	+0.05
farewell	1.00	1.00	0.90	0.88	+0.02
flight_announce	1.00	1.00	0.90	0.90	0.00
safety	1.33	1.00	0.81	0.90	−0.09
question	1.75	1.00	0.84	1.00	−0.16

WER/CER are Whisper proxies — listen to audio below. In-training best on 4 prompts (step 6k–8.4k): welcome CER 0.042, flight_announce WER 0.29 — different protocol than post-train table.

Audio samples

8 curated phrases on HF (ΔCER ≥ 5pp, finetuned CER ≤ 0.85, duration ≥ 1.5s). Excluded: safety, question. Full benchmark: eval/eval_results.json.

B = baseline · F = finetuned

Welcome — istanbul'a hoş geldiniz. · F-CER 0.21 B F

Flight — sayın yolcularımız, uçuşumuz yaklaşık iki saat sürecektir. · F-CER 0.24 B F

Directions — düz gidin, sonra sağa dönün ve köprüyü geçin. · F-CER 0.27 B F

News intro · Farewell · Weather · Tech · Emergency — see eval/samples_manifest.json for all 8 with metrics.

Quick Start

bash
pip install torch transformers peft soundfile librosa snac

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC

MODEL = "AbDhumal/orpheus-3b-turkish-tts-v2"
V = 128_256
TOK_SOH, TOK_EOH, TOK_SOA, TOK_SOS, TOK_EOA = V+3, V+4, V+5, V+1, V+6
CODE_OFFSET, N_PER_FRAME = V+10, 7

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16).eval().cuda()
snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().cuda()

text = "istanbul'a hoş geldiniz."
ids = tokenizer.encode(text, add_special_tokens=False) + [V+9]
prompt = [TOK_SOH] + ids + [TOK_EOH, TOK_SOA, TOK_SOS]
out = model.generate(torch.tensor([prompt]).cuda(), max_new_tokens=1500, min_new_tokens=80,
    do_sample=True, temperature=0.3, top_p=0.9, repetition_penalty=1.15, eos_token_id=TOK_EOA)
# Decode SNAC tokens → 24 kHz WAV (see repo scripts/evaluate_orpheus.py)

Specs & training

	Model		Training
Arch	Llama-3 3B + SNAC head · ~185M LoRA params	LR	2e-5
LoRA	r=32, α=64, dropout 0.05	Epochs / steps	8 / 9,504
Codec	SNAC 24 kHz · 7 tok/frame	Batch (effective)	2 × 2 GPU × accum 4 = 16
Data	20k WAV+transcript · max seq 4,096	Precision	bfloat16 + FA2
Loss	Audio tokens only (mask through `<\|start_of_speech\|>`)	Runtime	Kubeflow TrainJob · 2×A100

python
# Prompt: [SOH] + text_tokens + [EOH, SOA, SOS] → generate SNAC until EOA
ids = tokenizer.encode(text, add_special_tokens=False) + [TOK_EOT]
prompt = [TOK_SOH] + ids + [TOK_EOH, TOK_SOA, TOK_SOS]
sos_idx = input_ids.index(TOK_SOS)
labels = [-100] * (sos_idx + 1) + input_ids[sos_idx + 1:]  # audio-only loss

Reproduction & artifacts

Resource	Link
Scripts + manifests	examples/tts-finetuning/orpheus-tts
Eval data	`eval/results.json` · `eval/mlflow/metrics_export.json` · `eval/samples_manifest.json`

bash
oc kustomize examples/tts-finetuning/orpheus-tts | oc apply -f - -n <ns>
oc apply -f manifests/trainjob-orpheus-v2.yaml
oc apply -f manifests/trainjob-orpheus-eval.yaml   # after training

Limitations: PoC checkpoint · Whisper metrics are noisy on Turkish · SNAC artifacts possible · TrainJob reported Failed post-train but final/ weights are valid · Not production-certified.

License: follows unsloth/orpheus-3b-0.1-pretrained.

bibtex
@misc{orpheus-turkish-tts-v2, title={Orpheus-3B Turkish TTS (OpenShift AI PoC)},
  author={Abhijeet Dhumal}, year={2026}, howpublished={\url{https://huggingface.co/AbDhumal/orpheus-3b-turkish-tts-v2}}}

orpheus-3b-turkish-tts-v2

Get help setting up a custom Dedicated Endpoints.

README