Introduction
This model is OpenAI Whisper large-v3-turbo, finetuned on 1400 hours of audio with manually created verbatim transcriptions from the TalTech Estonian Speech Dataset 1.0 (https://cs.taltech.ee/staff/tanel.alumae/data/est-pub-asr-data/),
around 4000 hours of automatically transcribed Estonian broadcast news data (with contextual biasing, using the accompanying texts, see https://huggingface.co/datasets/TalTechNLP/err-video-news-transcribed) and around 500 hours of English data
from podcasts and YouTube (automatically transcribed using whisper-large-v3-turbo). The model should be especially strong in transcribing Estonian speech where
English terms are used inside Estonian sentences (e.g. tech podcasts, etc).
[!Note]
2026-06-17 update: model in GGML format uploaded (see ggml subdirectory under files). Use it with tools that depend on whisper.cpp.
Usage
It's a finetuned vesion of Whisper large-v3-turbo and can be therefore used via Hugging Face 🤗 Transformers. To run the model, first install the Transformers
library. For this example, we'll also install 🤗 Accelerate to reduce the model loading time:
pip install --upgrade pip
pip install --upgrade transformers accelerate
The model can be used with the pipeline
class to transcribe audios of arbitrary length:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "TalTechNLP/whisper-large-v3-turbo-et-verbatim-2604"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
audio = "demo/etteütlus2024.wav"
result = pipe(sample, generate_kwargs={"task": "transcribe", "language": "et"})
print(result)
There is a also a ct2 version of the model that can be used with tools that are based on faster-whisper, e.g. whisper-ctranslate2 command line program, e.g.:
$ whisper-ctranslate2 --model_directory ct2 --language et --vad_filter True --threads 8 --output_dir demo demo/etteütlus2024.wav
Detected language 'Estonian' with probability 1.000000
[00:00.610 --> 00:08.710] Kas pole teps mitte kihvt, et haridus- ja teadusministeerium paikneb Tartus Munga tänaval?
[00:08.710 --> 00:23.270] Seal ülikooli peahoonest mõne kukesammu kaugusel tuleb pedagoogikaalased otsused langetada kivisse raiutud imposantsete kultuuriheeroste märksa pilgu all.
[00:23.290 --> 00:45.210] Peeter Põllu esimese haridusministri rühikas selg tuletab meelde koolmeistrite määravat osatähtsust ühiskonnas ning üksi silmi teineteist jälgivad Kreutzwald ja Kalevipoeg kõrvu Oskar Lutsuliku kaine literaadi pilguga ei lase unustada Eesti vaimuilma alusväärtusi.
[00:45.210 --> 00:52.670] Vahest peaks valitsusegi Stenbocki majast rahvusülikooli akadeemilisse mõju välja kupatama.
[00:52.670 --> 01:05.850] Nii oleks võimukandjatel ehk mahti ilmavaate turgutamiseks linnaraamatukogust kübekene tarkust nõutada või Tartu Kunstimuuseumis kultustaieseid nautida.
[01:05.850 --> 01:16.270] Piisa torni sarnane majamürakas võib tekitada muidugi äraspidise tunde, et Emajõe Ateenas on alalõpmata midagi viltu.
Citation
@article{olev2025open,
title={Open source platform for Estonian speech transcription},
author={Olev, Aivo and Alum{\"a}e, Tanel},
journal={Language Resources and Evaluation},
volume={59},
number={4},
pages={4421--4438},
year={2025},
publisher={Springer}
}