shyngys879

kazakh-whisper-large-v3-turbo

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Key Features

  • Optimized specifically for Kazakh ASR
  • Trained on 841k+ speech-transcript pairs
  • 1,500+ hours of speech
  • Based on Whisper Large-v3 Turbo
  • Ready-to-use Transformers checkpoint
  • Evaluated on external and internal benchmarks
  • Compatible with Hugging Face pipelines
  • Suitable for production and research use

Benchmark Summary

FLEURS Kazakh Test

Table
ModelWER ↓CER ↓
Kazakh Whisper Large-v3 Turbo11.80%4.98%
Whisper Large-v3 Turbo19.75%5.05%
Wav2Vec2 XLSR Kazakh21.75%6.24%
Whisper Large-v331.10%6.57%
Whisper Medium48.69%10.92%
Whisper Small70.45%21.23%

The model achieves the strongest performance among the evaluated open-source Kazakh ASR systems in this benchmark suite.

Internal Kazakh ASR Test Set

This benchmark uses a fixed 5,000-example held-out test subset from the cleaned Kazakh ASR mixture.

This evaluation is an internal benchmark and should be interpreted as a secondary result. The FLEURS Kazakh test is used as the main external benchmark.

Table
ModelExamplesWER ↓CER ↓
Kazakh Whisper Large-v3 Turbo5,0008.73%2.10%
Whisper Large-v3 Turbo5,00017.37%3.73%
Wav2Vec2 XLSR Kazakh5,00021.07%4.22%
Whisper Large-v35,00040.80%9.09%
Whisper Medium5,00059.05%14.58%
Whisper Small5,00077.48%23.52%

Inference Speed Benchmark

Inference speed was measured on 106 audio files (30.27 minutes total audio) using a Kaggle NVIDIA T4 GPU.

This benchmark is intended to provide a practical comparison of recognition speed across commonly used open-source ASR models.

Table
ModelWER ↓CER ↓Inference Time ↓RTF ↓Speed ↑
Kazakh Whisper Large-v3 Turbo11.80%4.98%117.28 sec0.064615.48Ɨ
Whisper Large-v3 Turbo19.75%5.05%125.37 sec0.069014.48Ɨ
Wav2Vec2 XLSR Kazakh21.75%6.24%11.56 sec0.0064157.08Ɨ
Whisper Large-v331.10%6.57%572.48 sec0.31533.17Ɨ
Whisper Medium48.69%10.92%394.08 sec0.21704.61Ɨ
Whisper Small70.45%21.23%225.84 sec0.12448.04Ɨ

Notes

  • WER (Word Error Rate): lower is better.
  • CER (Character Error Rate): lower is better.
  • RTF (Real-Time Factor): lower is faster.
  • Speed: audio duration divided by inference time. Higher is faster.

While the Wav2Vec2 baseline provides the fastest inference speed, Kazakh Whisper Large-v3 Turbo achieves substantially stronger recognition quality while maintaining practical real-time performance.

Among the evaluated models, Kazakh Whisper Large-v3 Turbo provides the best overall quality-speed tradeoff for Kazakh speech recognition workloads.


Usage

Installation

bash

pip install transformers accelerate torch torchaudio

Quick Start

The model can be used directly with the Hugging Face pipeline() API:

python

from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="shyngys879/kazakh-whisper-large-v3-turbo"
)
result = asr(
"audio.wav",
generate_kwargs={
"language": "kk",
"task": "transcribe"
}
)
print(result["text"])

GPU Inference

For faster inference, use GPU inference with FP16:

python

import torch
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="shyngys879/kazakh-whisper-large-v3-turbo",
torch_dtype=torch.float16,
device="cuda"
)
result = asr(
"audio.wav",
generate_kwargs={
"language": "kk",
"task": "transcribe"
}
)
print(result["text"])

CPU Inference

python

from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="shyngys879/kazakh-whisper-large-v3-turbo",
device="cpu"
)
result = asr(
"audio.wav",
generate_kwargs={
"language": "kk",
"task": "transcribe"
}
)
print(result["text"])

Long Audio

For long audio files such as meetings, podcasts, interviews, and lectures, use chunked inference:

python

from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="shyngys879/kazakh-whisper-large-v3-turbo",
chunk_length_s=30,
batch_size=8
)
result = asr(
"meeting.wav",
generate_kwargs={
"language": "kk",
"task": "transcribe"
}
)
print(result["text"])

Batch Processing

python

from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="shyngys879/kazakh-whisper-large-v3-turbo"
)
files = [
"audio1.wav",
"audio2.wav",
"audio3.wav"
]
for file in files:
result = asr(
file,
generate_kwargs={
"language": "kk",
"task": "transcribe"
}
)
print(file, result["text"])

Direct Transformers Usage

The model can also be used directly with WhisperProcessor and WhisperForConditionalGeneration:

python

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
model_id = "shyngys879/kazakh-whisper-large-v3-turbo"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
audio_path = "audio.wav"
speech_array, sampling_rate = torchaudio.load(audio_path)
if sampling_rate != 16000:
resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
speech_array = resampler(speech_array)
audio_array = speech_array.squeeze().numpy()
inputs = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt"
)
input_features = inputs.input_features.to(
model.device,
dtype=torch.float16
)
with torch.no_grad():
predicted_ids = model.generate(
input_features,
language="kk",
task="transcribe",
max_new_tokens=225
)
text = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
print("Prediction:", text)

Dataset Annotation

python

from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="shyngys879/kazakh-whisper-large-v3-turbo"
)
audio_files = [
"sample1.wav",
"sample2.wav",
"sample3.wav"
]
annotations = []
for audio_file in audio_files:
result = asr(
audio_file,
generate_kwargs={
"language": "kk",
"task": "transcribe"
}
)
annotations.append({
"audio": audio_file,
"transcript": result["text"]
})
print(annotations)

Best Practices

For best performance:

  • Use 16 kHz audio when possible.
  • Use language="kk" and task="transcribe".
  • Segment very long recordings before transcription.
  • Use chunked inference for meetings, podcasts, interviews, and lectures.
  • Use GPU inference for large-scale workloads.
  • Use clean speech recordings whenever possible.

Training Data

The model was trained on a cleaned and deduplicated mixture of public Kazakh ASR datasets:

  • issai/Kazakh_Speech_Corpus_2
  • farabi-lab/kazakh-stt
  • voice-biomarkers/openslr-140-hq-Kazakh
  • Flamme-VRM/kazakh-speech-dataset
  • SRP-base-model-training/kazakh_speech_dataset_ksd
  • Shirali/ISSAI_KSC_335RS_v_1_1
  • SRP-base-model-training/kazakh_speech_corpus_2
  • sarulab-speech/yodas2_sidon

Dataset Statistics

Table
SplitExamples
Train841,113
Validation26,863
Test32,439
Total900,415

Intended Applications

This model can be used for:

  • Speech transcription
  • Subtitle generation
  • Podcast transcription
  • Interview transcription
  • Meeting transcription
  • Voice assistants
  • Dataset annotation
  • Educational applications
  • Call-center analytics
  • Kazakh NLP pipelines

Model Details

Table
FieldValue
Model NameKazakh Whisper Large-v3 Turbo
ArchitectureWhisper
Base Modelopenai/whisper-large-v3-turbo
LanguageKazakh
Parameters0.8B
PrecisionFP16
Training Examples841,113
Approximate Audio1,500+ hours
Training Steps26,286
Effective Epochs2

Limitations

Performance may degrade on:

  • Heavy background noise
  • Overlapping speakers
  • Strong accents or dialects
  • Code-switched Kazakh/Russian speech
  • Music-heavy recordings
  • Very long audio without segmentation

Citation

bibtex

@misc{sovetkhan2026kazakhwhisper,
title={Kazakh Whisper Large-v3 Turbo},
author={Shyngys Sovetkhan},
year={2026},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/shyngys879/kazakh-whisper-large-v3-turbo}
}

Model provider

shyngys879

Model tree

Base

openai/whisper-large-v3-turbo

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today