Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0TL;DR
A single fine-tune on ~74.4h of Dutch Common Voice takes WER from ~44.03% (base Whisper-tiny) to ~22.41% (49.1% relative drop) on a held-out, speaker- and sentence-disjoint test split.
3-axis evaluation (accuracy / footprint / speed)
All systems scored on the same held-out panel through one shared text normalizer
(BasicTextNormalizer). RTF = CPU compute seconds per audio second (lower is faster).
| Model | params | size (fp32) | RTF (CPU) | cv17-test | fleurs-test | mean WER |
|---|---|---|---|---|---|---|
| LokaalHub/whisper-klein-nl (ours) | 58M | 230.7 MB | 0.161 | 28.63 | 40.13 | 34.38% |
| openai/whisper-tiny | 38M | 151.0 MB | 0.091 | 46.15 | 49.14 | 47.64% |
Usage
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition", model="LokaalHub/whisper-klein-nl")asr("audio.wav", generate_kwargs={"language": "nl", "task": "transcribe"})
Training
Standard Hugging Face Seq2SeqTrainer fine-tune (bf16), built and verified by the
tiny-asr-loop pipeline.
Limitations
Tiny-model fine-tune on read speech (Common Voice). The internal test split is small and speaker-disjoint — see the panel table for FLEURS / out-of-domain numbers.
Model provider
LokaalHub
Model tree
Base
openai/whisper-tiny
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information