Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherModel Details
| Field | Value |
|---|---|
| Base model | openai/whisper-small |
| Adaptation method | LoRA / PEFT |
| Model type | Whisper encoder-decoder ASR model |
| Task | Automatic Speech Recognition / Speech-to-Text |
| Languages | Kazakh, Russian, Kazakh-Russian mixed speech |
| Repository type | LoRA adapter |
| Project context | Academic thesis research |
LoRA Configuration
| Parameter | Value |
|---|---|
Rank r | 64 |
Alpha α | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, v_proj, k_proj, out_proj, fc1, fc2 |
Recommended Use
This model is mainly intended for:
- academic ASR research;
- Kazakh speech recognition experiments;
- Kazakh-Russian mixed-speech transcription;
- code-switching ASR evaluation;
- comparison with other KRASR ASR models;
- reproducibility of thesis experiments;
- demonstration in speech-to-text applications.
For best results, use it on speech where Kazakh is dominant and Russian appears as inserted words or phrases.
Quick Start
Install the required libraries:
bash
pip install -U transformers peft accelerate torch librosa soundfile evaluate tqdm
Run inference on one audio file
python
import torchimport librosafrom peft import PeftModelfrom transformers import WhisperForConditionalGeneration, WhisperProcessor, pipelinebase_model_id = "openai/whisper-small"adapter_id = "KRASR/kazakh-russian-asr-whisper-small-lora"audio_path = "audio.wav"device = "cuda:0" if torch.cuda.is_available() else "cpu"pipeline_device = 0 if torch.cuda.is_available() else -1torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32processor = WhisperProcessor.from_pretrained(base_model_id)base_model = WhisperForConditionalGeneration.from_pretrained(base_model_id,torch_dtype=torch_dtype,low_cpu_mem_usage=True,)model = PeftModel.from_pretrained(base_model, adapter_id)model = model.merge_and_unload()model.to(device)model.eval()asr = pipeline(task="automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,torch_dtype=torch_dtype,device=pipeline_device,)audio, sr = librosa.load(audio_path, sr=16000, mono=True)result = asr({"array": audio, "sampling_rate": sr},generate_kwargs={"task": "transcribe","language": "kazakh","num_beams": 3,"no_repeat_ngram_size": 4,"repetition_penalty": 1.12,},)print(result["text"])
Decoding Notes
For normal testing, avoid using a fixed max_new_tokens value such as 96.
A fixed limit can accidentally cut off longer transcriptions.
A good starting point for Kazakh-dominant mixed speech is:
python
generate_kwargs = {"task": "transcribe","language": "kazakh","num_beams": 3,"no_repeat_ngram_size": 4,"repetition_penalty": 1.12,}
Why forced Kazakh decoding?
In mixed Kazakh-Russian speech, automatic language detection can be unstable, especially on short utterances.
If the audio is mostly Kazakh with Russian insertions, forcing Kazakh decoding usually keeps the transcription closer to the target speech domain.
Russian words may still appear in the output when the model recognizes them from the audio.
Optional dynamic token limit for apps or batch evaluation
For applications or controlled batch evaluation, a dynamic limit is safer than one fixed value.
The following helper follows the same idea as the KRASR demo module: short clips receive a smaller output limit, while longer clips receive a larger limit.
python
def build_generate_kwargs(audio_duration_sec: float | None,language: str | None = "kazakh",num_beams: int = 3,) -> dict:if audio_duration_sec is None:max_new_tokens = 96elif audio_duration_sec <= 5:max_new_tokens = 40elif audio_duration_sec <= 10:max_new_tokens = 64elif audio_duration_sec <= 15:max_new_tokens = 80elif audio_duration_sec <= 20:max_new_tokens = 96elif audio_duration_sec <= 25:max_new_tokens = 112else:max_new_tokens = 128generate_kwargs = {"task": "transcribe","num_beams": int(num_beams),"max_new_tokens": max_new_tokens,"no_repeat_ngram_size": 4,"repetition_penalty": 1.12,}if language is not None:generate_kwargs["language"] = languagereturn generate_kwargs
Use this only when you specifically need output-length control. For ordinary one-file testing, starting without max_new_tokens is usually simpler.
Evaluation Results
| Evaluation set | WER | CER | Notes |
|---|---|---|---|
| Test-MIXED | 0.5626 | 0.3308 | Main Kazakh-Russian mixed-speech test set |
| Test-KK | 0.5385 | - | Internal pure Kazakh test set |
| Test-RU | 0.7879 | - | Internal pure Russian test set |
| FLEURS-KK | 0.7603 | - | External Kazakh benchmark |
| FLEURS-RU | 0.7740 | - | External Russian benchmark |
The model significantly improved over the original Whisper-Small baseline on the main mixed-speech test set, but it is not the strongest model in the KRASR collection. Russian-only recognition became less stable after adaptation to Kazakh-dominant mixed speech.
The thesis evaluation used beam-search-based decoding and output-length control for the main fine-tuned Whisper-Small comparison.
Batch Evaluation Example
The following example shows how to test the model on a simple JSONL manifest.
Expected manifest format:
json
{"audio": "path/to/audio.wav", "text": "reference transcription"}
Evaluation script:
python
import jsonimport torchimport librosaimport evaluatefrom tqdm import tqdmfrom peft import PeftModelfrom transformers import WhisperForConditionalGeneration, WhisperProcessor, pipelinebase_model_id = "openai/whisper-small"adapter_id = "KRASR/kazakh-russian-asr-whisper-small-lora"manifest_path = "test_manifest.jsonl"device = "cuda:0" if torch.cuda.is_available() else "cpu"pipeline_device = 0 if torch.cuda.is_available() else -1torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32wer_metric = evaluate.load("wer")cer_metric = evaluate.load("cer")def build_generate_kwargs(audio_duration_sec, language="kazakh", num_beams=3):if audio_duration_sec is None:max_new_tokens = 96elif audio_duration_sec <= 5:max_new_tokens = 40elif audio_duration_sec <= 10:max_new_tokens = 64elif audio_duration_sec <= 15:max_new_tokens = 80elif audio_duration_sec <= 20:max_new_tokens = 96elif audio_duration_sec <= 25:max_new_tokens = 112else:max_new_tokens = 128generate_kwargs = {"task": "transcribe","num_beams": int(num_beams),"max_new_tokens": max_new_tokens,"no_repeat_ngram_size": 4,"repetition_penalty": 1.12,}if language is not None:generate_kwargs["language"] = languagereturn generate_kwargsprocessor = WhisperProcessor.from_pretrained(base_model_id)base_model = WhisperForConditionalGeneration.from_pretrained(base_model_id,torch_dtype=torch_dtype,low_cpu_mem_usage=True,)model = PeftModel.from_pretrained(base_model, adapter_id)model = model.merge_and_unload()model.to(device)model.eval()asr = pipeline(task="automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,torch_dtype=torch_dtype,device=pipeline_device,)predictions = []references = []with open(manifest_path, "r", encoding="utf-8") as f:rows = [json.loads(line) for line in f]for row in tqdm(rows):audio, sr = librosa.load(row["audio"], sr=16000, mono=True)duration_sec = len(audio) / srresult = asr({"array": audio, "sampling_rate": sr},generate_kwargs=build_generate_kwargs(audio_duration_sec=duration_sec,language="kazakh",num_beams=3,),)predictions.append(result["text"])references.append(row["text"])wer = wer_metric.compute(predictions=predictions, references=references)cer = cer_metric.compute(predictions=predictions, references=references)print(f"WER: {wer:.4f}")print(f"CER: {cer:.4f}")
Training Data
The model was fine-tuned using the KRASR/kazakh-russian-asr-dataset, prepared for Kazakh and Kazakh-Russian mixed-speech ASR experiments.
The dataset preparation workflow included:
- source selection;
- audio segmentation;
- transcription review;
- text normalization;
- train/validation/test split preparation;
- evaluation setup for mixed-language ASR.
The dataset was prepared for speech recognition only. It was not designed for speaker identification, biometric analysis, or demographic classification.
Preprocessing
Audio and text were prepared using a consistent ASR preprocessing pipeline.
Audio preprocessing:
- mono audio;
- 16 kHz sampling rate;
- short-segment ASR setting.
Text normalization included:
- lowercasing;
- whitespace normalization;
- punctuation cleanup;
- preservation of Kazakh-specific letters;
- preservation of Russian words in mixed utterances;
- removal of formatting noise that does not affect transcription meaning.
Known Limitations
The model may make errors on:
- very short audio clips;
- noisy recordings;
- overlapping speech;
- informal conversational speech;
- rare names, places, and domain-specific terms;
- silent or low-quality audio.
Like other Whisper-based models, it may sometimes produce extra words or hallucinated text, especially when the input audio is too short, unclear, or contains long silence.
Out-of-Scope Use
This model is not intended for:
- speaker identification;
- biometric profiling;
- demographic classification;
- surveillance or tracking of individuals;
- high-stakes decision-making systems;
- production deployment without additional validation.
Project Context
KRASR was created as part of an academic thesis project on automatic Kazakh speech-to-text conversion using fine-tuned multilingual ASR models.
The project compares Whisper-Small baseline, Whisper-Small LoRA, Whisper-Small full fine-tuning, XLS-R 1B CTC, and Whisper Large-v3 LoRA on Kazakh, Russian, and Kazakh-Russian mixed speech.
Related Repositories
KRASR/kazakh-russian-asr-datasetKRASR/kazakh-russian-asr-whisper-small-loraKRASR/kazakh-russian-asr-whisper-small-full-ftKRASR/kazakh-russian-asr-whisper-large-v3-loraKRASR/kazakh-russian-asr-xls-r-1b-ctcKRASR/kazakh-russian-speech-to-text-module
Citation
There is no formal publication for this model yet.
If you use this model or dataset in academic work, please cite or mention the KRASR Hugging Face repository and the related thesis project:
bibtex
@misc{krasr_whisper_small_lora_2026,title = {Kazakh-Russian ASR Whisper Small LoRA},author = {Mukhambet, Madiyar and Makhmud, Danial},year = {2026},publisher = {Hugging Face},howpublished = {\url{https://huggingface.co/KRASR/kazakh-russian-asr-whisper-small-lora}},note = {LoRA adapter for Kazakh and Kazakh-Russian mixed-speech ASR}}
Related thesis project:
Madiyar Mukhambet and Danial Makhmud.
Development of a Software Module for Automatic Kazakh Speech-to-Text Conversion Based on Fine-Tuned Whisper-Small Model.
Astana IT University, 2026.
Model provider
KRASR
Model tree
Base
openai/whisper-small
Adapter
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information