Model Description
- Base model: openai/whisper-large-v3 (1.55B parameters)
- Fine-tuning: Full fine-tune (all parameters trainable)
- Training data: 1,367 hours of Swiss German speech from broadcast subtitles, parliamentary proceedings, YouTube, and Swiss film
- Task: Swiss German speech → Standard German text (dialect-to-standard translation + transcription)
- Hardware: NVIDIA DGX Spark GB10 (128 GB unified memory), single desktop workstation
Table with columns: Metric, Value, Notes| Metric | Value | Notes |
|---|
| WER (measured) | 25.60% | ASGDTS, 5,750 samples, honest evaluation |
| cWER (content errors only) | 13.8% | Excludes style/convention differences |
| sWER (style component) | 11.3% | Valid alternative translations penalized by WER |
| bWER (bias-corrected) | 8.5% | Estimated true error rate |
| Whisper large-v3 baseline | 28.56% | Zero-shot, no fine-tuning |
Important Context on WER
Our WER of 25.60% should be interpreted carefully:
- ~64% of evaluation samples are semantically correct (KORREKT + STIL categories) but penalized by WER due to transcription convention differences (tense, reformulation style)
- The genuine content error rate is 13.8% cWER; bias-corrected estimation yields 8.5% bWER
- Published lower WER scores (Michaud 17.5%, ZHAW 17.1%) are inflated by benchmark contamination — see our paper for details
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
model_id = "Flix-AI/flix-swissgerman-full"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
audio_array = ...
input_features = processor(
audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device, dtype=torch.bfloat16)
predicted_ids = model.generate(input_features, language="de", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Training Details
Data Sources
Table with columns: Source, Hours, License, Content| Source | Hours | License | Content |
|---|
| SRF Mediathek | 848h | Research use (Art. 24d URG) | Broadcast subtitles (news, entertainment, documentary) |
| Swiss Parliament (SPC v2) | 202h | CC BY 4.0 | Parliamentary speeches (Grosser Rat BE) |
| YouTube | 151h | Research use (Art. 24d URG) | 25 institutional channels (cantons, police, podcasts) |
| PlaySuisse | 165h | Research use (Art. 24d URG) |
No training data is redistributed with this model. The model was trained under the Swiss text and data mining research exception (Art. 24d URG).
Training Configuration
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Trainable parameters | 1,543,490,560 (100%) |
| Optimizer | AdamW |
| Learning rate | 1×10⁻⁵ (cosine decay) |
| Warmup steps | 500 |
| Effective batch size | 32 |
| Precision | bfloat16 |
| Gradient checkpointing | Enabled |
| SpecAugment | Enabled |
| Training time | ~73 hours (2 epochs) |
Dialect Coverage
The training data covers all major Swiss German dialect regions:
Table with columns: Dialect, Primary Source| Dialect | Primary Source |
|---|
| Züridütsch | SRF, YouTube |
| Berndeutsch | SPC v2 (dominant), SRF |
| Luzernerdeutsch | SRF, YouTube |
| Baseldeutsch | SRF, YouTube |
| St. Gallerdeutsch | SRF, YouTube |
| Walliserdeutsch | SRF, PlaySuisse |
| Bündnerdeutsch | YouTube |
| Appenzellerdeutsch | SRF |
Limitations
- Proper nouns: The model may misspell names and places it hasn't encountered during training
- Word order: Swiss German sentence structure sometimes differs from Standard German; the model may produce valid but differently ordered translations
- Convention mismatch: Trained on broadcast subtitles (editorial style), which may differ from verbatim transcription expectations
- No context: The model processes segments independently; it cannot use broader conversation context for disambiguation
Citation
@article{akeret2026whisper-swiss-german,
title={Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6\% WER (13.8\% cWER)},
author={Akeret, Felix},
year={2026},
url={https://arxiv.org/abs/2606.07608},
eprint={2606.07608},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Acknowledgments
- OpenAI for the Whisper model
- FHNW/i4ds for the Swiss Parliament Corpus (SPC v2) and ASGDTS benchmark
- SRF for publicly accessible broadcast content
- PlaySuisse for Swiss film and series content