shyngys879
kazakh-whisper-large-v3-turbo
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Key Features
- Optimized specifically for Kazakh ASR
- Trained on 841k+ speech-transcript pairs
- 1,500+ hours of speech
- Based on Whisper Large-v3 Turbo
- Ready-to-use Transformers checkpoint
- Evaluated on external and internal benchmarks
- Compatible with Hugging Face pipelines
- Suitable for production and research use
Benchmark Summary
FLEURS Kazakh Test
| Model | WER ā | CER ā |
|---|---|---|
| Kazakh Whisper Large-v3 Turbo | 11.80% | 4.98% |
| Whisper Large-v3 Turbo | 19.75% | 5.05% |
| Wav2Vec2 XLSR Kazakh | 21.75% | 6.24% |
| Whisper Large-v3 | 31.10% | 6.57% |
| Whisper Medium | 48.69% | 10.92% |
| Whisper Small | 70.45% | 21.23% |
The model achieves the strongest performance among the evaluated open-source Kazakh ASR systems in this benchmark suite.
Internal Kazakh ASR Test Set
This benchmark uses a fixed 5,000-example held-out test subset from the cleaned Kazakh ASR mixture.
This evaluation is an internal benchmark and should be interpreted as a secondary result. The FLEURS Kazakh test is used as the main external benchmark.
| Model | Examples | WER ā | CER ā |
|---|---|---|---|
| Kazakh Whisper Large-v3 Turbo | 5,000 | 8.73% | 2.10% |
| Whisper Large-v3 Turbo | 5,000 | 17.37% | 3.73% |
| Wav2Vec2 XLSR Kazakh | 5,000 | 21.07% | 4.22% |
| Whisper Large-v3 | 5,000 | 40.80% | 9.09% |
| Whisper Medium | 5,000 | 59.05% | 14.58% |
| Whisper Small | 5,000 | 77.48% | 23.52% |
Inference Speed Benchmark
Inference speed was measured on 106 audio files (30.27 minutes total audio) using a Kaggle NVIDIA T4 GPU.
This benchmark is intended to provide a practical comparison of recognition speed across commonly used open-source ASR models.
| Model | WER ā | CER ā | Inference Time ā | RTF ā | Speed ā |
|---|---|---|---|---|---|
| Kazakh Whisper Large-v3 Turbo | 11.80% | 4.98% | 117.28 sec | 0.0646 | 15.48Ć |
| Whisper Large-v3 Turbo | 19.75% | 5.05% | 125.37 sec | 0.0690 | 14.48Ć |
| Wav2Vec2 XLSR Kazakh | 21.75% | 6.24% | 11.56 sec | 0.0064 | 157.08Ć |
| Whisper Large-v3 | 31.10% | 6.57% | 572.48 sec | 0.3153 | 3.17Ć |
| Whisper Medium | 48.69% | 10.92% | 394.08 sec | 0.2170 | 4.61Ć |
| Whisper Small | 70.45% | 21.23% | 225.84 sec | 0.1244 | 8.04Ć |
Notes
- WER (Word Error Rate): lower is better.
- CER (Character Error Rate): lower is better.
- RTF (Real-Time Factor): lower is faster.
- Speed: audio duration divided by inference time. Higher is faster.
While the Wav2Vec2 baseline provides the fastest inference speed, Kazakh Whisper Large-v3 Turbo achieves substantially stronger recognition quality while maintaining practical real-time performance.
Among the evaluated models, Kazakh Whisper Large-v3 Turbo provides the best overall quality-speed tradeoff for Kazakh speech recognition workloads.
Usage
Installation
bash
pip install transformers accelerate torch torchaudio
Quick Start
The model can be used directly with the Hugging Face pipeline() API:
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition",model="shyngys879/kazakh-whisper-large-v3-turbo")result = asr("audio.wav",generate_kwargs={"language": "kk","task": "transcribe"})print(result["text"])
GPU Inference
For faster inference, use GPU inference with FP16:
python
import torchfrom transformers import pipelineasr = pipeline("automatic-speech-recognition",model="shyngys879/kazakh-whisper-large-v3-turbo",torch_dtype=torch.float16,device="cuda")result = asr("audio.wav",generate_kwargs={"language": "kk","task": "transcribe"})print(result["text"])
CPU Inference
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition",model="shyngys879/kazakh-whisper-large-v3-turbo",device="cpu")result = asr("audio.wav",generate_kwargs={"language": "kk","task": "transcribe"})print(result["text"])
Long Audio
For long audio files such as meetings, podcasts, interviews, and lectures, use chunked inference:
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition",model="shyngys879/kazakh-whisper-large-v3-turbo",chunk_length_s=30,batch_size=8)result = asr("meeting.wav",generate_kwargs={"language": "kk","task": "transcribe"})print(result["text"])
Batch Processing
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition",model="shyngys879/kazakh-whisper-large-v3-turbo")files = ["audio1.wav","audio2.wav","audio3.wav"]for file in files:result = asr(file,generate_kwargs={"language": "kk","task": "transcribe"})print(file, result["text"])
Direct Transformers Usage
The model can also be used directly with WhisperProcessor and WhisperForConditionalGeneration:
python
import torchimport torchaudiofrom transformers import WhisperProcessor, WhisperForConditionalGenerationmodel_id = "shyngys879/kazakh-whisper-large-v3-turbo"processor = WhisperProcessor.from_pretrained(model_id)model = WhisperForConditionalGeneration.from_pretrained(model_id,torch_dtype=torch.float16,device_map="auto")audio_path = "audio.wav"speech_array, sampling_rate = torchaudio.load(audio_path)if sampling_rate != 16000:resampler = torchaudio.transforms.Resample(sampling_rate, 16000)speech_array = resampler(speech_array)audio_array = speech_array.squeeze().numpy()inputs = processor(audio_array,sampling_rate=16000,return_tensors="pt")input_features = inputs.input_features.to(model.device,dtype=torch.float16)with torch.no_grad():predicted_ids = model.generate(input_features,language="kk",task="transcribe",max_new_tokens=225)text = processor.batch_decode(predicted_ids,skip_special_tokens=True)[0]print("Prediction:", text)
Dataset Annotation
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition",model="shyngys879/kazakh-whisper-large-v3-turbo")audio_files = ["sample1.wav","sample2.wav","sample3.wav"]annotations = []for audio_file in audio_files:result = asr(audio_file,generate_kwargs={"language": "kk","task": "transcribe"})annotations.append({"audio": audio_file,"transcript": result["text"]})print(annotations)
Best Practices
For best performance:
- Use 16 kHz audio when possible.
- Use
language="kk"andtask="transcribe". - Segment very long recordings before transcription.
- Use chunked inference for meetings, podcasts, interviews, and lectures.
- Use GPU inference for large-scale workloads.
- Use clean speech recordings whenever possible.
Training Data
The model was trained on a cleaned and deduplicated mixture of public Kazakh ASR datasets:
- issai/Kazakh_Speech_Corpus_2
- farabi-lab/kazakh-stt
- voice-biomarkers/openslr-140-hq-Kazakh
- Flamme-VRM/kazakh-speech-dataset
- SRP-base-model-training/kazakh_speech_dataset_ksd
- Shirali/ISSAI_KSC_335RS_v_1_1
- SRP-base-model-training/kazakh_speech_corpus_2
- sarulab-speech/yodas2_sidon
Dataset Statistics
| Split | Examples |
|---|---|
| Train | 841,113 |
| Validation | 26,863 |
| Test | 32,439 |
| Total | 900,415 |
Intended Applications
This model can be used for:
- Speech transcription
- Subtitle generation
- Podcast transcription
- Interview transcription
- Meeting transcription
- Voice assistants
- Dataset annotation
- Educational applications
- Call-center analytics
- Kazakh NLP pipelines
Model Details
| Field | Value |
|---|---|
| Model Name | Kazakh Whisper Large-v3 Turbo |
| Architecture | Whisper |
| Base Model | openai/whisper-large-v3-turbo |
| Language | Kazakh |
| Parameters | 0.8B |
| Precision | FP16 |
| Training Examples | 841,113 |
| Approximate Audio | 1,500+ hours |
| Training Steps | 26,286 |
| Effective Epochs | 2 |
Limitations
Performance may degrade on:
- Heavy background noise
- Overlapping speakers
- Strong accents or dialects
- Code-switched Kazakh/Russian speech
- Music-heavy recordings
- Very long audio without segmentation
Citation
bibtex
@misc{sovetkhan2026kazakhwhisper,title={Kazakh Whisper Large-v3 Turbo},author={Shyngys Sovetkhan},year={2026},howpublished={Hugging Face Model Hub},url={https://huggingface.co/shyngys879/kazakh-whisper-large-v3-turbo}}
Model provider
shyngys879
Model tree
Base
openai/whisper-large-v3-turbo
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information