KRASR

kazakh-russian-asr-whisper-small-full-ft

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Details

Table with columns: Field, Value
Field	Value
Base model	`openai/whisper-small`
Adaptation method	Full fine-tuning
Model type	Whisper encoder-decoder ASR model
Task	Automatic Speech Recognition / Speech-to-Text
Languages	Kazakh, Russian, Kazakh-Russian mixed speech
Repository type	Full fine-tuned model
Project context	Academic thesis research

Recommended Use

This model is mainly intended for:

academic ASR research;
Kazakh speech recognition experiments;
Kazakh-Russian mixed-speech transcription;
code-switching ASR evaluation;
comparison with Whisper-Small LoRA;
reproducibility of thesis experiments;
demonstration in speech-to-text applications.

This model is useful when you want to test a fully updated Whisper-Small checkpoint rather than a parameter-efficient LoRA adapter.

For stronger recognition quality, see the larger comparative models in the KRASR collection, especially KRASR/kazakh-russian-asr-whisper-large-v3-lora.

Quick Start

Install the required libraries:

bash
pip install -U transformers accelerate torch librosa soundfile evaluate tqdm

Simple pipeline inference

python
import torch
from transformers import pipeline

model_id = "KRASR/kazakh-russian-asr-whisper-small-full-ft"

device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

asr = pipeline(
    task="automatic-speech-recognition",
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
)

result = asr(
    "audio.wav",
    generate_kwargs={
        "task": "transcribe",
        "language": "kazakh",
        "num_beams": 3,
        "no_repeat_ngram_size": 4,
        "repetition_penalty": 1.12,
    },
)

print(result["text"])

Manual Loading Example

python
import torch
import librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

model_id = "KRASR/kazakh-russian-asr-whisper-small-full-ft"
audio_path = "audio.wav"

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipeline_device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

processor = WhisperProcessor.from_pretrained(model_id)

model = WhisperForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)

model.to(device)
model.eval()

asr = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=pipeline_device,
)

audio, sr = librosa.load(audio_path, sr=16000, mono=True)

result = asr(
    {"array": audio, "sampling_rate": sr},
    generate_kwargs={
        "task": "transcribe",
        "language": "kazakh",
        "num_beams": 3,
        "no_repeat_ngram_size": 4,
        "repetition_penalty": 1.12,
    },
)

print(result["text"])

Decoding Notes

For normal testing, avoid using a fixed max_new_tokens value such as 96.
A fixed limit can accidentally cut off longer transcriptions.

A good starting point for Kazakh-dominant mixed speech is:

python
generate_kwargs = {
    "task": "transcribe",
    "language": "kazakh",
    "num_beams": 3,
    "no_repeat_ngram_size": 4,
    "repetition_penalty": 1.12,
}

Why forced Kazakh decoding?

In mixed Kazakh-Russian speech, automatic language detection can be unstable, especially on short utterances.
If the audio is mostly Kazakh with Russian insertions, forcing Kazakh decoding usually keeps the transcription closer to the target speech domain.

Russian words may still appear in the output when the model recognizes them from the audio.

Optional dynamic token limit for apps or batch evaluation

For applications or controlled batch evaluation, a dynamic output limit can be useful.
This follows the same idea as the KRASR demo module: short clips receive a smaller output limit, while longer clips receive a larger one.

python
def build_generate_kwargs(audio_duration_sec, language="kazakh", num_beams=3):
    if audio_duration_sec is None:
        max_new_tokens = 96
    elif audio_duration_sec <= 5:
        max_new_tokens = 40
    elif audio_duration_sec <= 10:
        max_new_tokens = 64
    elif audio_duration_sec <= 15:
        max_new_tokens = 80
    elif audio_duration_sec <= 20:
        max_new_tokens = 96
    elif audio_duration_sec <= 25:
        max_new_tokens = 112
    else:
        max_new_tokens = 128

    generate_kwargs = {
        "task": "transcribe",
        "num_beams": int(num_beams),
        "max_new_tokens": max_new_tokens,
        "no_repeat_ngram_size": 4,
        "repetition_penalty": 1.12,
    }

    if language is not None:
        generate_kwargs["language"] = language

    return generate_kwargs

Use this only when you specifically need output-length control.
For ordinary one-file testing, starting without max_new_tokens is usually simpler.

Training Configuration

Table with columns: Parameter, Value
Parameter	Value
Base model	`openai/whisper-small`
Training manifest	`train_all.jsonl`
Validation manifest	`val_all.jsonl`
Sampling rate	16 kHz
Maximum audio duration	30 seconds
Number of epochs	8
Learning rate	`1e-5`

Full fine-tuning updates all Whisper-Small parameters. This gives the model more freedom to adapt to the target dataset, but it requires more compute than LoRA adaptation.

Evaluation Results

Main mixed-speech evaluation

Table with columns: Model, Decoding, WER, CER, MER, WIL, Hyp/ref, Possible hallucination-like cases
Model	Decoding	WER	CER	MER	WIL	Hyp/ref	Possible hallucination-like cases
Whisper-Small baseline	Greedy	1.0852	0.8256	0.9254	-	0.8780	-
Whisper-Small LoRA	Duration-based token limit + Beam3	0.5626	0.3308

The full fine-tuned model achieved slightly better WER and CER than Whisper-Small LoRA on the main Test-MIXED set, but the difference was small.

Effect of beam search on Whisper-Small Full FT

Table with columns: Decoding, WER, CER, Insertions, Hyp/ref, Possible hallucination-like cases
Decoding	WER	CER	Insertions	Hyp/ref	Possible hallucination-like cases
Greedy	0.5828	0.3311	1493	0.9073	11
Beam3	0.5581	0.3219	1204	0.8882	5
Beam5	0.5551

Beam search improved the full fine-tuned model compared with greedy decoding. Beam5 gave the best observed Whisper-Small Full FT result on Test-MIXED, while Beam3 was used in the main comparison for consistency.

Pure-language and external benchmark behavior

Table with columns: Evaluation set, WER
Evaluation set	WER
Test-KK	0.5393
Test-RU	0.7192
FLEURS-KK	0.7609
FLEURS-RU	0.7419

The model improved internal Kazakh recognition compared with the original Whisper-Small baseline, but Russian-only recognition became less stable after adaptation to Kazakh-dominant mixed speech.

Training Data

The model was fine-tuned using the KRASR/kazakh-russian-asr-dataset, prepared for Kazakh and Kazakh-Russian mixed-speech ASR experiments.

The dataset preparation workflow included:

source selection;
audio segmentation;
transcription review;
text normalization;
train/validation/test split preparation;
evaluation setup for mixed-language ASR.

The dataset was prepared for speech recognition only. It was not designed for speaker identification, biometric analysis, or demographic classification.

Preprocessing

Audio and text were prepared using a consistent ASR preprocessing pipeline.

Audio preprocessing:

mono audio;
16 kHz sampling rate;
short-segment ASR setting.

Text normalization included:

lowercasing;
whitespace normalization;
punctuation cleanup;
preservation of Kazakh-specific letters;
preservation of Russian words in mixed utterances;
removal of formatting noise that does not affect transcription meaning.

Known Limitations

The model may make errors on:

very short audio clips;
noisy recordings;
overlapping speech;
informal conversational speech;
rare names, places, and domain-specific terms;
long Russian segments inside Kazakh-dominant speech;
silent or low-quality audio.

Like other Whisper-based models, it may sometimes produce extra words or hallucinated text, especially when the input audio is too short, unclear, or contains long silence.

The model is stronger than the original Whisper-Small baseline on the main mixed-speech evaluation set, but it remains weaker than the larger or alternative KRASR comparative models such as Whisper Large-v3 LoRA and XLS-R 1B CTC.

Out-of-Scope Use

This model is not intended for:

speaker identification;
biometric profiling;
demographic classification;
surveillance or tracking of individuals;
high-stakes decision-making systems;
production deployment without additional validation.

Project Context

KRASR was created as part of an academic thesis project on automatic Kazakh speech-to-text conversion using fine-tuned multilingual ASR models.

The project compares Whisper-Small baseline, Whisper-Small LoRA, Whisper-Small full fine-tuning, XLS-R 1B CTC, and Whisper Large-v3 LoRA on Kazakh, Russian, and Kazakh-Russian mixed speech.

KRASR/kazakh-russian-asr-dataset
KRASR/kazakh-russian-asr-whisper-small-lora
KRASR/kazakh-russian-asr-whisper-large-v3-lora
KRASR/kazakh-russian-asr-xls-r-1b-ctc
KRASR/kazakh-russian-speech-to-text-module

Citation

There is no formal publication for this model yet.

If you use this model or dataset in academic work, please cite or mention the KRASR Hugging Face repository and the related thesis project:

bibtex
@misc{krasr_whisper_small_full_ft_2026,
  title        = {Kazakh-Russian ASR Whisper Small Full Fine-Tuning},
  author       = {Mukhambet, Madiyar and Makhmud, Danial},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/KRASR/kazakh-russian-asr-whisper-small-full-ft}},
  note         = {Fully fine-tuned Whisper-Small model for Kazakh and Kazakh-Russian mixed-speech ASR}
}

Related thesis project:

Madiyar Mukhambet and Danial Makhmud.
Development of a Software Module for Automatic Kazakh Speech-to-Text Conversion Based on Fine-Tuned Whisper-Small Model.
Astana IT University, 2026.

Model provider

KRASR

Model tree

Base

openai/whisper-small

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Table with columns: Field, Value
Field	Value
Base model	`openai/whisper-small`
Adaptation method	Full fine-tuning
Model type	Whisper encoder-decoder ASR model
Task	Automatic Speech Recognition / Speech-to-Text
Languages	Kazakh, Russian, Kazakh-Russian mixed speech
Repository type	Full fine-tuned model
Project context	Academic thesis research

Recommended Use

This model is mainly intended for:

academic ASR research;
Kazakh speech recognition experiments;
Kazakh-Russian mixed-speech transcription;
code-switching ASR evaluation;
comparison with Whisper-Small LoRA;
reproducibility of thesis experiments;
demonstration in speech-to-text applications.

This model is useful when you want to test a fully updated Whisper-Small checkpoint rather than a parameter-efficient LoRA adapter.

For stronger recognition quality, see the larger comparative models in the KRASR collection, especially KRASR/kazakh-russian-asr-whisper-large-v3-lora.

Quick Start

Install the required libraries:

bash
pip install -U transformers accelerate torch librosa soundfile evaluate tqdm

Simple pipeline inference

python
import torch
from transformers import pipeline

model_id = "KRASR/kazakh-russian-asr-whisper-small-full-ft"

device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

asr = pipeline(
    task="automatic-speech-recognition",
    model=model_id,
    torch_dtype=torch_dtype,
    device=device,
)

result = asr(
    "audio.wav",
    generate_kwargs={
        "task": "transcribe",
        "language": "kazakh",
        "num_beams": 3,
        "no_repeat_ngram_size": 4,
        "repetition_penalty": 1.12,
    },
)

print(result["text"])

Manual Loading Example

python
import torch
import librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

model_id = "KRASR/kazakh-russian-asr-whisper-small-full-ft"
audio_path = "audio.wav"

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipeline_device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

processor = WhisperProcessor.from_pretrained(model_id)

model = WhisperForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)

model.to(device)
model.eval()

asr = pipeline(
    task="automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=pipeline_device,
)

audio, sr = librosa.load(audio_path, sr=16000, mono=True)

result = asr(
    {"array": audio, "sampling_rate": sr},
    generate_kwargs={
        "task": "transcribe",
        "language": "kazakh",
        "num_beams": 3,
        "no_repeat_ngram_size": 4,
        "repetition_penalty": 1.12,
    },
)

print(result["text"])

Decoding Notes

For normal testing, avoid using a fixed max_new_tokens value such as 96.
A fixed limit can accidentally cut off longer transcriptions.

A good starting point for Kazakh-dominant mixed speech is:

python
generate_kwargs = {
    "task": "transcribe",
    "language": "kazakh",
    "num_beams": 3,
    "no_repeat_ngram_size": 4,
    "repetition_penalty": 1.12,
}

Why forced Kazakh decoding?

Russian words may still appear in the output when the model recognizes them from the audio.

Optional dynamic token limit for apps or batch evaluation

python
def build_generate_kwargs(audio_duration_sec, language="kazakh", num_beams=3):
    if audio_duration_sec is None:
        max_new_tokens = 96
    elif audio_duration_sec <= 5:
        max_new_tokens = 40
    elif audio_duration_sec <= 10:
        max_new_tokens = 64
    elif audio_duration_sec <= 15:
        max_new_tokens = 80
    elif audio_duration_sec <= 20:
        max_new_tokens = 96
    elif audio_duration_sec <= 25:
        max_new_tokens = 112
    else:
        max_new_tokens = 128

    generate_kwargs = {
        "task": "transcribe",
        "num_beams": int(num_beams),
        "max_new_tokens": max_new_tokens,
        "no_repeat_ngram_size": 4,
        "repetition_penalty": 1.12,
    }

    if language is not None:
        generate_kwargs["language"] = language

    return generate_kwargs

Use this only when you specifically need output-length control.
For ordinary one-file testing, starting without max_new_tokens is usually simpler.

Training Configuration

Table with columns: Parameter, Value
Parameter	Value
Base model	`openai/whisper-small`
Training manifest	`train_all.jsonl`
Validation manifest	`val_all.jsonl`
Sampling rate	16 kHz
Maximum audio duration	30 seconds
Number of epochs	8
Learning rate	`1e-5`

Full fine-tuning updates all Whisper-Small parameters. This gives the model more freedom to adapt to the target dataset, but it requires more compute than LoRA adaptation.

Evaluation Results

Main mixed-speech evaluation

Table with columns: Model, Decoding, WER, CER, MER, WIL, Hyp/ref, Possible hallucination-like cases
Model	Decoding	WER	CER	MER	WIL	Hyp/ref	Possible hallucination-like cases
Whisper-Small baseline	Greedy	1.0852	0.8256	0.9254	-	0.8780	-
Whisper-Small LoRA	Duration-based token limit + Beam3	0.5626	0.3308

The full fine-tuned model achieved slightly better WER and CER than Whisper-Small LoRA on the main Test-MIXED set, but the difference was small.

Effect of beam search on Whisper-Small Full FT

Table with columns: Decoding, WER, CER, Insertions, Hyp/ref, Possible hallucination-like cases
Decoding	WER	CER	Insertions	Hyp/ref	Possible hallucination-like cases
Greedy	0.5828	0.3311	1493	0.9073	11
Beam3	0.5581	0.3219	1204	0.8882	5
Beam5	0.5551

Pure-language and external benchmark behavior

Table with columns: Evaluation set, WER
Evaluation set	WER
Test-KK	0.5393
Test-RU	0.7192
FLEURS-KK	0.7609
FLEURS-RU	0.7419

The model improved internal Kazakh recognition compared with the original Whisper-Small baseline, but Russian-only recognition became less stable after adaptation to Kazakh-dominant mixed speech.

Training Data

The model was fine-tuned using the KRASR/kazakh-russian-asr-dataset, prepared for Kazakh and Kazakh-Russian mixed-speech ASR experiments.

The dataset preparation workflow included:

source selection;
audio segmentation;
transcription review;
text normalization;
train/validation/test split preparation;
evaluation setup for mixed-language ASR.

The dataset was prepared for speech recognition only. It was not designed for speaker identification, biometric analysis, or demographic classification.

Preprocessing

Audio and text were prepared using a consistent ASR preprocessing pipeline.

Audio preprocessing:

mono audio;
16 kHz sampling rate;
short-segment ASR setting.

Text normalization included:

lowercasing;
whitespace normalization;
punctuation cleanup;
preservation of Kazakh-specific letters;
preservation of Russian words in mixed utterances;
removal of formatting noise that does not affect transcription meaning.

Known Limitations

The model may make errors on:

very short audio clips;
noisy recordings;
overlapping speech;
informal conversational speech;
rare names, places, and domain-specific terms;
long Russian segments inside Kazakh-dominant speech;
silent or low-quality audio.

Like other Whisper-based models, it may sometimes produce extra words or hallucinated text, especially when the input audio is too short, unclear, or contains long silence.

Out-of-Scope Use

This model is not intended for:

speaker identification;
biometric profiling;
demographic classification;
surveillance or tracking of individuals;
high-stakes decision-making systems;
production deployment without additional validation.

Project Context

KRASR was created as part of an academic thesis project on automatic Kazakh speech-to-text conversion using fine-tuned multilingual ASR models.

The project compares Whisper-Small baseline, Whisper-Small LoRA, Whisper-Small full fine-tuning, XLS-R 1B CTC, and Whisper Large-v3 LoRA on Kazakh, Russian, and Kazakh-Russian mixed speech.

KRASR/kazakh-russian-asr-dataset
KRASR/kazakh-russian-asr-whisper-small-lora
KRASR/kazakh-russian-asr-whisper-large-v3-lora
KRASR/kazakh-russian-asr-xls-r-1b-ctc
KRASR/kazakh-russian-speech-to-text-module

Citation

There is no formal publication for this model yet.

If you use this model or dataset in academic work, please cite or mention the KRASR Hugging Face repository and the related thesis project:

bibtex
@misc{krasr_whisper_small_full_ft_2026,
  title        = {Kazakh-Russian ASR Whisper Small Full Fine-Tuning},
  author       = {Mukhambet, Madiyar and Makhmud, Danial},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/KRASR/kazakh-russian-asr-whisper-small-full-ft}},
  note         = {Fully fine-tuned Whisper-Small model for Kazakh and Kazakh-Russian mixed-speech ASR}
}

Related thesis project:

Madiyar Mukhambet and Danial Makhmud.
Development of a Software Module for Automatic Kazakh Speech-to-Text Conversion Based on Fine-Tuned Whisper-Small Model.
Astana IT University, 2026.

kazakh-russian-asr-whisper-small-full-ft

Get help setting up a custom Dedicated Endpoints.

Model Details

Recommended Use

Quick Start

Simple pipeline inference

Manual Loading Example

Decoding Notes

Optional dynamic token limit for apps or batch evaluation

Training Configuration

Evaluation Results

Main mixed-speech evaluation

Effect of beam search on Whisper-Small Full FT

Pure-language and external benchmark behavior

Training Data

Preprocessing

Known Limitations

Out-of-Scope Use

Project Context

Related Repositories

Citation

Explore FriendliAI today

Model Details

Recommended Use

Quick Start

Simple pipeline inference

Manual Loading Example

Decoding Notes

Optional dynamic token limit for apps or batch evaluation

Training Configuration

Evaluation Results

Main mixed-speech evaluation

Effect of beam search on Whisper-Small Full FT

Pure-language and external benchmark behavior

Training Data

Preprocessing

Known Limitations

Out-of-Scope Use

Project Context

Related Repositories

Citation