HossamRizk

Temsah-TTS

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model details

Table
Base modelunsloth/Spark-TTS-0.5B (Qwen2.5-0.5B LLM + BiCodec)
TaskText-to-speech (Egyptian Arabic, single speaker)
Fine-tune typeFull fine-tune of the LLM only (BiCodec frozen)
LanguageArabic — Egyptian dialect (not MSA)
Output16 kHz mono waveform
CloningZero-shot — speaker timbre comes from a reference clip at inference time

Training

  • Framework: Unsloth + TRL SFTTrainer, full fine-tuning in float32.
  • Data: single-speaker Egyptian Arabic, filtered to 1–45 s clips → 52,495 clips (51,682 train / 813 validation), leakage-safe split by source video.
  • Text normalization: diacritics (tashkeel) stripped, numbers verbalized, non-speech / laugh / sigh clips dropped, vocal pause/breath markers stripped. The same cleaning must be applied to inference text for in-distribution results.
  • Hyperparameters: 2 epochs (3,232 steps), effective batch size 32 (per_device=4 × grad_accum=8), learning rate 1e-5, max_seq_length=3072, group_by_length=True, linear schedule, adamw_8bit.
  • Result: validation loss tracks training loss throughout (≈5.17 at the end, still decreasing) — no overfitting observed.

Usage

This model is the LLM stage of a Spark-TTS pipeline. Assemble a full Spark model directory, then run inference with the Spark-TTS code.

bash

git clone https://github.com/SparkAudio/Spark-TTS
pip install -r Spark-TTS/requirements.txt soundfile

python

import os, shutil, sys, torch, soundfile as sf
from huggingface_hub import snapshot_download
# 1. base model -> BiCodec + wav2vec2 + config.yaml
base = snapshot_download("unsloth/Spark-TTS-0.5B")
# 2. this fine-tuned LLM
llm = snapshot_download("HossamRizk/Temsah-TTS")
# 3. assemble a Spark-servable model dir (base layout, our LLM swapped in)
root = "Temsah-TTS-spark"
shutil.rmtree(root, ignore_errors=True); os.makedirs(root)
shutil.copy(f"{base}/config.yaml", f"{root}/config.yaml")
shutil.copytree(f"{base}/BiCodec", f"{root}/BiCodec")
shutil.copytree(f"{base}/wav2vec2-large-xlsr-53", f"{root}/wav2vec2-large-xlsr-53")
shutil.copytree(llm, f"{root}/LLM")
# 4. synthesize (needs a reference clip of the target speaker for the voice)
sys.path.append("Spark-TTS")
from cli.SparkTTS import SparkTTS
tts = SparkTTS(root, device="cuda:0" if torch.cuda.is_available() else "cpu")
wav = tts.inference(
"النهارده هنتكلم عن موضوع مهم جدا يخص كل واحد فينا",
prompt_speech_path="reference_clip.wav", # a few seconds of the target speaker
)
sf.write("output.wav", wav, 16000)

Tip: clean the input text the same way training did (strip diacritics, verbalize numbers, drop non-speech tags) so it matches the training distribution.

Limitations & intended use

  • Single voice. Designed to reproduce one speaker. The reference clip supplies the timbre; using a different speaker's clip will not sound like the trained voice.
  • Egyptian dialect. Trained on Egyptian Arabic; MSA or other dialects are out of scope.
  • Audio band. Source audio is 16 kHz / band-limited (web sources) → clear but not studio-bright output.
  • Ethical use / consent. This is a clone of a specific person's voice. Only use it with the speaker's consent and in line with the source dataset's terms. Do not use it to impersonate, deceive, or generate misleading content.

License

Released under apache-2.0 (inherited from the base model). Review this against the source dataset's terms and the speaker's wishes before redistribution — change if needed.

Acknowledgements

Model provider

HossamRizk

Model tree

Base

unsloth/Spark-TTS-0.5B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today