Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Description

  • Base model: openai/whisper-large-v3 (1.55B parameters)
  • Fine-tuning: LoRA (r=160, α=32, dropout=0.05)
  • Trainable parameters: ~1.1B (LoRA weights across q, k, v, out, fc1, fc2)
  • Training data: 1,092 hours of Swiss German speech from broadcast subtitles, parliamentary proceedings, and YouTube
  • Task: Swiss German speech → Standard German text (dialect-to-standard translation + transcription)
  • Hardware: NVIDIA DGX Spark GB10 (128 GB unified memory), single desktop workstation

Performance

MetricValueNotes
WER (measured)25.32%ASGDTS, 200 samples (seed=42), honest evaluation
cWER (content errors only)13.9%Excludes style/convention differences
sWER (style component)11.3%Valid alternative translations penalized by WER
bWER (bias-corrected)8.5%Estimated true error rate
Whisper large-v3 baseline28.56%Zero-shot, no fine-tuning

Important Context on WER

Our WER of 25.32% should be interpreted carefully:

  • ~64% of evaluation samples are semantically correct (KORREKT + STIL categories) but penalized by WER due to transcription convention differences (tense, reformulation style)
  • The genuine content error rate is 13.9% cWER; bias-corrected estimation yields 8.5% bWER
  • Published lower WER scores (Michaud 17.5%, ZHAW 17.1%) are inflated by benchmark contamination — see our paper for details

Usage

python

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
import torch
base_model_id = "openai/whisper-large-v3"
adapter_id = "Flix-AI/flix-swissgerman-lora"
processor = WhisperProcessor.from_pretrained(base_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
base_model_id, torch_dtype=torch.float32, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)
# Transcribe Swiss German audio
audio_array = ... # numpy array, 16kHz mono
input_features = processor(
audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device)
predicted_ids = model.generate(input_features, language="de", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

LoRA Configuration

ParameterValue
Rank (r)160
Alpha (α)32
Dropout0.05
Target modulesq_proj, k_proj, v_proj, out_proj, fc1, fc2
Task typeSEQ_2_SEQ_LM
PEFT version0.18.1

Training Details

Data Sources

SourceHoursLicenseContent
SRF Mediathek690hResearch use (Art. 24d URG)Broadcast subtitles (news, entertainment, documentary)
Swiss Parliament (SPC v2)202hCC BY 4.0Parliamentary speeches (Grosser Rat BE)
YouTube151hResearch use (Art. 24d URG)25 institutional channels (cantons, police, podcasts)
PlaySuisse49hResearch use (Art. 24d URG)Swiss films and series
Total1,092h

No training data is redistributed with this model. The model was trained under the Swiss text and data mining research exception (Art. 24d URG).

Training Configuration

ParameterValue
OptimizerAdamW
Learning rate2×10⁻⁴ (cosine decay)
Warmup steps500
Effective batch size32
Precisionfloat32
SpecAugmentEnabled
Training time~60 hours

Dialect Coverage

The training data covers all major Swiss German dialect regions:

DialectPrimary Source
ZüridütschSRF, YouTube
BerndeutschSPC v2 (dominant), SRF
LuzernerdeutschSRF, YouTube
BaseldeutschSRF, YouTube
St. GallerdeutschSRF, YouTube
WalliserdeutschSRF, PlaySuisse
BündnerdeutschYouTube
AppenzellerdeutschSRF

Limitations

  1. Proper nouns: The model may misspell names and places it hasn't encountered during training
  2. Word order: Swiss German sentence structure sometimes differs from Standard German; the model may produce valid but differently ordered translations
  3. Convention mismatch: Trained on broadcast subtitles (editorial style), which may differ from verbatim transcription expectations
  4. No context: The model processes segments independently; it cannot use broader conversation context for disambiguation

Citation

bibtex

@article{akeret2026whisper-swiss-german,
title={Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6\% WER (13.8\% cWER)},
author={Akeret, Felix},
year={2026},
url={https://huggingface.co/Flix-AI/flix-swissgerman-lora}
}

Acknowledgments

  • OpenAI for the Whisper model
  • FHNW/i4ds for the Swiss Parliament Corpus (SPC v2) and ASGDTS benchmark
  • SRF for publicly accessible broadcast content
  • PlaySuisse for Swiss film and series content

Model provider

Flix-AI

Model tree

Base

openai/whisper-large-v3

Adapter

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today