shyngys879/kazakh-whisper-large-v3-turbo API & Inference Endpoint

Key Features

Optimized specifically for Kazakh ASR
Trained on 841k+ speech-transcript pairs
1,500+ hours of speech
Based on Whisper Large-v3 Turbo
Ready-to-use Transformers checkpoint
Evaluated on external and internal benchmarks
Compatible with Hugging Face pipelines
Suitable for production and research use

Benchmark Summary

FLEURS Kazakh Test

Table
Model	WER ↓	CER ↓
Kazakh Whisper Large-v3 Turbo	11.80%	4.98%
Whisper Large-v3 Turbo	19.75%	5.05%
Wav2Vec2 XLSR Kazakh	21.75%	6.24%
Whisper Large-v3	31.10%	6.57%
Whisper Medium	48.69%	10.92%
Whisper Small	70.45%	21.23%

The model achieves the strongest performance among the evaluated open-source Kazakh ASR systems in this benchmark suite.

Internal Kazakh ASR Test Set

This benchmark uses a fixed 5,000-example held-out test subset from the cleaned Kazakh ASR mixture.

This evaluation is an internal benchmark and should be interpreted as a secondary result. The FLEURS Kazakh test is used as the main external benchmark.

Table
Model	Examples	WER ↓	CER ↓
Kazakh Whisper Large-v3 Turbo	5,000	8.73%	2.10%
Whisper Large-v3 Turbo	5,000	17.37%	3.73%
Wav2Vec2 XLSR Kazakh	5,000	21.07%	4.22%
Whisper Large-v3	5,000	40.80%	9.09%
Whisper Medium	5,000	59.05%	14.58%
Whisper Small	5,000	77.48%	23.52%

Inference Speed Benchmark

Inference speed was measured on 106 audio files (30.27 minutes total audio) using a Kaggle NVIDIA T4 GPU.

This benchmark is intended to provide a practical comparison of recognition speed across commonly used open-source ASR models.

Table
Model	WER ↓	CER ↓	Inference Time ↓	RTF ↓	Speed ↑
Kazakh Whisper Large-v3 Turbo	11.80%	4.98%	117.28 sec	0.0646	15.48×
Whisper Large-v3 Turbo	19.75%	5.05%	125.37 sec	0.0690	14.48×
Wav2Vec2 XLSR Kazakh	21.75%	6.24%	11.56 sec	0.0064	157.08×
Whisper Large-v3	31.10%	6.57%	572.48 sec	0.3153	3.17×
Whisper Medium	48.69%	10.92%	394.08 sec	0.2170	4.61×
Whisper Small	70.45%	21.23%	225.84 sec	0.1244	8.04×

Notes

WER (Word Error Rate): lower is better.
CER (Character Error Rate): lower is better.
RTF (Real-Time Factor): lower is faster.
Speed: audio duration divided by inference time. Higher is faster.

While the Wav2Vec2 baseline provides the fastest inference speed, Kazakh Whisper Large-v3 Turbo achieves substantially stronger recognition quality while maintaining practical real-time performance.

Among the evaluated models, Kazakh Whisper Large-v3 Turbo provides the best overall quality-speed tradeoff for Kazakh speech recognition workloads.

Usage

Installation

bash
pip install transformers accelerate torch torchaudio

Quick Start

The model can be used directly with the Hugging Face pipeline() API:

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="shyngys879/kazakh-whisper-large-v3-turbo"
)

result = asr(
    "audio.wav",
    generate_kwargs={
        "language": "kk",
        "task": "transcribe"
    }
)

print(result["text"])

GPU Inference

For faster inference, use GPU inference with FP16:

python
import torch
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="shyngys879/kazakh-whisper-large-v3-turbo",
    torch_dtype=torch.float16,
    device="cuda"
)

result = asr(
    "audio.wav",
    generate_kwargs={
        "language": "kk",
        "task": "transcribe"
    }
)

print(result["text"])

CPU Inference

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="shyngys879/kazakh-whisper-large-v3-turbo",
    device="cpu"
)

result = asr(
    "audio.wav",
    generate_kwargs={
        "language": "kk",
        "task": "transcribe"
    }
)

print(result["text"])

Long Audio

For long audio files such as meetings, podcasts, interviews, and lectures, use chunked inference:

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="shyngys879/kazakh-whisper-large-v3-turbo",
    chunk_length_s=30,
    batch_size=8
)

result = asr(
    "meeting.wav",
    generate_kwargs={
        "language": "kk",
        "task": "transcribe"
    }
)

print(result["text"])

Batch Processing

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="shyngys879/kazakh-whisper-large-v3-turbo"
)

files = [
    "audio1.wav",
    "audio2.wav",
    "audio3.wav"
]

for file in files:
    result = asr(
        file,
        generate_kwargs={
            "language": "kk",
            "task": "transcribe"
        }
    )

    print(file, result["text"])

Direct Transformers Usage

The model can also be used directly with WhisperProcessor and WhisperForConditionalGeneration:

python
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model_id = "shyngys879/kazakh-whisper-large-v3-turbo"

processor = WhisperProcessor.from_pretrained(model_id)

model = WhisperForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

audio_path = "audio.wav"

speech_array, sampling_rate = torchaudio.load(audio_path)

if sampling_rate != 16000:
    resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
    speech_array = resampler(speech_array)

audio_array = speech_array.squeeze().numpy()

inputs = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt"
)

input_features = inputs.input_features.to(
    model.device,
    dtype=torch.float16
)

with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        language="kk",
        task="transcribe",
        max_new_tokens=225
    )

text = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print("Prediction:", text)

Dataset Annotation

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="shyngys879/kazakh-whisper-large-v3-turbo"
)

audio_files = [
    "sample1.wav",
    "sample2.wav",
    "sample3.wav"
]

annotations = []

for audio_file in audio_files:
    result = asr(
        audio_file,
        generate_kwargs={
            "language": "kk",
            "task": "transcribe"
        }
    )

    annotations.append({
        "audio": audio_file,
        "transcript": result["text"]
    })

print(annotations)

Best Practices

For best performance:

Use 16 kHz audio when possible.
Use language="kk" and task="transcribe".
Segment very long recordings before transcription.
Use chunked inference for meetings, podcasts, interviews, and lectures.
Use GPU inference for large-scale workloads.
Use clean speech recordings whenever possible.

Training Data

The model was trained on a cleaned and deduplicated mixture of public Kazakh ASR datasets:

issai/Kazakh_Speech_Corpus_2
farabi-lab/kazakh-stt
voice-biomarkers/openslr-140-hq-Kazakh
Flamme-VRM/kazakh-speech-dataset
SRP-base-model-training/kazakh_speech_dataset_ksd
Shirali/ISSAI_KSC_335RS_v_1_1
SRP-base-model-training/kazakh_speech_corpus_2
sarulab-speech/yodas2_sidon

Dataset Statistics

Table
Split	Examples
Train	841,113
Validation	26,863
Test	32,439
Total	900,415

Intended Applications

This model can be used for:

Speech transcription
Subtitle generation
Podcast transcription
Interview transcription
Meeting transcription
Voice assistants
Dataset annotation
Educational applications
Call-center analytics
Kazakh NLP pipelines

Model Details

Table
Field	Value
Model Name	Kazakh Whisper Large-v3 Turbo
Architecture	Whisper
Base Model	openai/whisper-large-v3-turbo
Language	Kazakh
Parameters	0.8B
Precision	FP16
Training Examples	841,113
Approximate Audio	1,500+ hours
Training Steps	26,286
Effective Epochs	2

Limitations

Performance may degrade on:

Heavy background noise
Overlapping speakers
Strong accents or dialects
Code-switched Kazakh/Russian speech
Music-heavy recordings
Very long audio without segmentation

Citation

bibtex
@misc{sovetkhan2026kazakhwhisper,
  title={Kazakh Whisper Large-v3 Turbo},
  author={Shyngys Sovetkhan},
  year={2026},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/shyngys879/kazakh-whisper-large-v3-turbo}
}

kazakh-whisper-large-v3-turbo

Get help setting up a custom Dedicated Endpoints.

README