Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
- Base model: openai/whisper-large-v3 (1.55B parameters)
- Fine-tuning: LoRA (r=160, α=32, dropout=0.05)
- Trainable parameters: ~1.1B (LoRA weights across q, k, v, out, fc1, fc2)
- Training data: 1,092 hours of Swiss German speech from broadcast subtitles, parliamentary proceedings, and YouTube
- Task: Swiss German speech → Standard German text (dialect-to-standard translation + transcription)
- Hardware: NVIDIA DGX Spark GB10 (128 GB unified memory), single desktop workstation
Performance
| Metric | Value | Notes |
|---|---|---|
| WER (measured) | 25.32% | ASGDTS, 200 samples (seed=42), honest evaluation |
| cWER (content errors only) | 13.9% | Excludes style/convention differences |
| sWER (style component) | 11.3% | Valid alternative translations penalized by WER |
| bWER (bias-corrected) | 8.5% | Estimated true error rate |
| Whisper large-v3 baseline | 28.56% | Zero-shot, no fine-tuning |
Important Context on WER
Our WER of 25.32% should be interpreted carefully:
- ~64% of evaluation samples are semantically correct (KORREKT + STIL categories) but penalized by WER due to transcription convention differences (tense, reformulation style)
- The genuine content error rate is 13.9% cWER; bias-corrected estimation yields 8.5% bWER
- Published lower WER scores (Michaud 17.5%, ZHAW 17.1%) are inflated by benchmark contamination — see our paper for details
Usage
python
from transformers import WhisperForConditionalGeneration, WhisperProcessorfrom peft import PeftModelimport torchbase_model_id = "openai/whisper-large-v3"adapter_id = "Flix-AI/flix-swissgerman-lora"processor = WhisperProcessor.from_pretrained(base_model_id)model = WhisperForConditionalGeneration.from_pretrained(base_model_id, torch_dtype=torch.float32, device_map="auto")model = PeftModel.from_pretrained(model, adapter_id)# Transcribe Swiss German audioaudio_array = ... # numpy array, 16kHz monoinput_features = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features.to(model.device)predicted_ids = model.generate(input_features, language="de", task="transcribe")transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]print(transcription)
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 160 |
| Alpha (α) | 32 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, out_proj, fc1, fc2 |
| Task type | SEQ_2_SEQ_LM |
| PEFT version | 0.18.1 |
Training Details
Data Sources
| Source | Hours | License | Content |
|---|---|---|---|
| SRF Mediathek | 690h | Research use (Art. 24d URG) | Broadcast subtitles (news, entertainment, documentary) |
| Swiss Parliament (SPC v2) | 202h | CC BY 4.0 | Parliamentary speeches (Grosser Rat BE) |
| YouTube | 151h | Research use (Art. 24d URG) | 25 institutional channels (cantons, police, podcasts) |
| PlaySuisse | 49h | Research use (Art. 24d URG) | Swiss films and series |
| Total | 1,092h |
No training data is redistributed with this model. The model was trained under the Swiss text and data mining research exception (Art. 24d URG).
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 2×10⁻⁴ (cosine decay) |
| Warmup steps | 500 |
| Effective batch size | 32 |
| Precision | float32 |
| SpecAugment | Enabled |
| Training time | ~60 hours |
Dialect Coverage
The training data covers all major Swiss German dialect regions:
| Dialect | Primary Source |
|---|---|
| Züridütsch | SRF, YouTube |
| Berndeutsch | SPC v2 (dominant), SRF |
| Luzernerdeutsch | SRF, YouTube |
| Baseldeutsch | SRF, YouTube |
| St. Gallerdeutsch | SRF, YouTube |
| Walliserdeutsch | SRF, PlaySuisse |
| Bündnerdeutsch | YouTube |
| Appenzellerdeutsch | SRF |
Limitations
- Proper nouns: The model may misspell names and places it hasn't encountered during training
- Word order: Swiss German sentence structure sometimes differs from Standard German; the model may produce valid but differently ordered translations
- Convention mismatch: Trained on broadcast subtitles (editorial style), which may differ from verbatim transcription expectations
- No context: The model processes segments independently; it cannot use broader conversation context for disambiguation
Citation
bibtex
@article{akeret2026whisper-swiss-german,title={Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6\% WER (13.8\% cWER)},author={Akeret, Felix},year={2026},url={https://huggingface.co/Flix-AI/flix-swissgerman-lora}}
Acknowledgments
- OpenAI for the Whisper model
- FHNW/i4ds for the Swiss Parliament Corpus (SPC v2) and ASGDTS benchmark
- SRF for publicly accessible broadcast content
- PlaySuisse for Swiss film and series content
Model provider
Flix-AI
Model tree
Base
openai/whisper-large-v3
Adapter
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information