Flix-AI

flix-swissgerman-full

README

License: apache-2.0

Model Description

Base model: openai/whisper-large-v3 (1.55B parameters)
Fine-tuning: Full fine-tune (all parameters trainable)
Training data: 1,367 hours of Swiss German speech from broadcast subtitles, parliamentary proceedings, YouTube, and Swiss film
Task: Swiss German speech → Standard German text (dialect-to-standard translation + transcription)
Hardware: NVIDIA DGX Spark GB10 (128 GB unified memory), single desktop workstation

Performance

Table with columns: Metric, Value, Notes
Metric	Value	Notes
WER (measured)	25.60%	ASGDTS, 5,750 samples, honest evaluation
cWER (content errors only)	13.8%	Excludes style/convention differences
sWER (style component)	11.3%	Valid alternative translations penalized by WER
bWER (bias-corrected)	8.5%	Estimated true error rate
Whisper large-v3 baseline	28.56%	Zero-shot, no fine-tuning

Important Context on WER

Our WER of 25.60% should be interpreted carefully:

~64% of evaluation samples are semantically correct (KORREKT + STIL categories) but penalized by WER due to transcription convention differences (tense, reformulation style)
The genuine content error rate is 13.8% cWER; bias-corrected estimation yields 8.5% bWER
Published lower WER scores (Michaud 17.5%, ZHAW 17.1%) are inflated by benchmark contamination — see our paper for details

Usage

python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "Flix-AI/flix-swissgerman-full"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Transcribe Swiss German audio
audio_array = ...  # numpy array, 16kHz mono
input_features = processor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device, dtype=torch.bfloat16)

predicted_ids = model.generate(input_features, language="de", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Data Sources

Table with columns: Source, Hours, License, Content
Source	Hours	License	Content
SRF Mediathek	848h	Research use (Art. 24d URG)	Broadcast subtitles (news, entertainment, documentary)
Swiss Parliament (SPC v2)	202h	CC BY 4.0	Parliamentary speeches (Grosser Rat BE)
YouTube	151h	Research use (Art. 24d URG)	25 institutional channels (cantons, police, podcasts)
PlaySuisse	165h	Research use (Art. 24d URG)

No training data is redistributed with this model. The model was trained under the Swiss text and data mining research exception (Art. 24d URG).

Training Configuration

Table with columns: Parameter, Value
Parameter	Value
Trainable parameters	1,543,490,560 (100%)
Optimizer	AdamW
Learning rate	1×10⁻⁵ (cosine decay)
Warmup steps	500
Effective batch size	32
Precision	bfloat16
Gradient checkpointing	Enabled
SpecAugment	Enabled
Training time	~73 hours (2 epochs)

Dialect Coverage

The training data covers all major Swiss German dialect regions:

Table with columns: Dialect, Primary Source
Dialect	Primary Source
Züridütsch	SRF, YouTube
Berndeutsch	SPC v2 (dominant), SRF
Luzernerdeutsch	SRF, YouTube
Baseldeutsch	SRF, YouTube
St. Gallerdeutsch	SRF, YouTube
Walliserdeutsch	SRF, PlaySuisse
Bündnerdeutsch	YouTube
Appenzellerdeutsch	SRF

Limitations

Proper nouns: The model may misspell names and places it hasn't encountered during training
Word order: Swiss German sentence structure sometimes differs from Standard German; the model may produce valid but differently ordered translations
Convention mismatch: Trained on broadcast subtitles (editorial style), which may differ from verbatim transcription expectations
No context: The model processes segments independently; it cannot use broader conversation context for disambiguation

Citation

bibtex
@article{akeret2026whisper-swiss-german,
  title={Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6\% WER (13.8\% cWER)},
  author={Akeret, Felix},
  year={2026},
  url={https://arxiv.org/abs/2606.07608},
  eprint={2606.07608},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Acknowledgments

OpenAI for the Whisper model
FHNW/i4ds for the Swiss Parliament Corpus (SPC v2) and ASGDTS benchmark
SRF for publicly accessible broadcast content
PlaySuisse for Swiss film and series content

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

Flix-AI

Model Tree

Base

this model

Input Modalities

Audio

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer