HossamRizk/Temsah-TTS API & Inference Endpoint

Model details

Table

Base model	`unsloth/Spark-TTS-0.5B` (Qwen2.5-0.5B LLM + BiCodec)
Task	Text-to-speech (Egyptian Arabic, single speaker)
Fine-tune type	Full fine-tune of the LLM only (BiCodec frozen)
Language	Arabic — Egyptian dialect (not MSA)
Output	16 kHz mono waveform
Cloning	Zero-shot — speaker timbre comes from a reference clip at inference time

Training

Framework: Unsloth + TRL SFTTrainer, full fine-tuning in float32.
Data: single-speaker Egyptian Arabic, filtered to 1–45 s clips → 52,495 clips (51,682 train / 813 validation), leakage-safe split by source video.
Text normalization: diacritics (tashkeel) stripped, numbers verbalized, non-speech / laugh / sigh clips dropped, vocal pause/breath markers stripped. The same cleaning must be applied to inference text for in-distribution results.
Hyperparameters: 2 epochs (3,232 steps), effective batch size 32 (per_device=4 × grad_accum=8), learning rate 1e-5, max_seq_length=3072, group_by_length=True, linear schedule, adamw_8bit.
Result: validation loss tracks training loss throughout (≈5.17 at the end, still decreasing) — no overfitting observed.

Usage

This model is the LLM stage of a Spark-TTS pipeline. Assemble a full Spark model directory, then run inference with the Spark-TTS code.

bash
git clone https://github.com/SparkAudio/Spark-TTS
pip install -r Spark-TTS/requirements.txt soundfile

python
import os, shutil, sys, torch, soundfile as sf
from huggingface_hub import snapshot_download

# 1. base model -> BiCodec + wav2vec2 + config.yaml
base = snapshot_download("unsloth/Spark-TTS-0.5B")
# 2. this fine-tuned LLM
llm  = snapshot_download("HossamRizk/Temsah-TTS")

# 3. assemble a Spark-servable model dir (base layout, our LLM swapped in)
root = "Temsah-TTS-spark"
shutil.rmtree(root, ignore_errors=True); os.makedirs(root)
shutil.copy(f"{base}/config.yaml", f"{root}/config.yaml")
shutil.copytree(f"{base}/BiCodec",                 f"{root}/BiCodec")
shutil.copytree(f"{base}/wav2vec2-large-xlsr-53",  f"{root}/wav2vec2-large-xlsr-53")
shutil.copytree(llm,                                f"{root}/LLM")

# 4. synthesize (needs a reference clip of the target speaker for the voice)
sys.path.append("Spark-TTS")
from cli.SparkTTS import SparkTTS

tts = SparkTTS(root, device="cuda:0" if torch.cuda.is_available() else "cpu")
wav = tts.inference(
    "النهارده هنتكلم عن موضوع مهم جدا يخص كل واحد فينا",
    prompt_speech_path="reference_clip.wav",  # a few seconds of the target speaker
)
sf.write("output.wav", wav, 16000)

Tip: clean the input text the same way training did (strip diacritics, verbalize numbers, drop non-speech tags) so it matches the training distribution.

Limitations & intended use

Single voice. Designed to reproduce one speaker. The reference clip supplies the timbre; using a different speaker's clip will not sound like the trained voice.
Egyptian dialect. Trained on Egyptian Arabic; MSA or other dialects are out of scope.
Audio band. Source audio is 16 kHz / band-limited (web sources) → clear but not studio-bright output.
Ethical use / consent. This is a clone of a specific person's voice. Only use it with the speaker's consent and in line with the source dataset's terms. Do not use it to impersonate, deceive, or generate misleading content.

License

Released under apache-2.0 (inherited from the base model). Review this against the source dataset's terms and the speaker's wishes before redistribution — change if needed.

Acknowledgements

Spark-TTS (SparkAudio) and unsloth/Spark-TTS-0.5B
Unsloth for fast fine-tuning
Source data: oddadmix/arabic-audio-collection-mohamed-khairy

Temsah-TTS

Get help setting up a custom Dedicated Endpoints.

README