Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Description

  • Model name: BLI ASR 0
  • Task: Automatic Speech Recognition
  • Language: Lingala
  • Base model: openai/whisper-large-v3
  • Adaptation method: LoRA / PEFT
  • Training dataset: Waxal Lingala ASR
  • Output: Lingala transcription from speech audio

This model transcribes Lingala speech into text. It is not a translation model.

Dataset

The model was trained on the Waxal Lingala ASR dataset.

The dataset was split into:

SplitApprox. number of samplesUsage
Train14,400Model training
Validation1,844Validation during development
Test1,866Final held-out evaluation

Text Post-processing

We applied a light normalization pipeline to the training and evaluation transcriptions.

The goal was not to impose a strict Lingala orthography, but to reduce noise and improve consistency. The post-processing included:

  • Unicode normalization
  • lowercasing
  • whitespace normalization
  • punctuation and symbol cleanup
  • preservation of the original raw transcription when available
  • creation of a normalized transcription field used for training/evaluation

We intentionally avoided aggressive spelling correction because Lingala has substantial orthographic variation across speakers, regions, and data sources.

Training Details

The model was fine-tuned from openai/whisper-large-v3 using LoRA.

Main training choices:

ParameterValue
Base modelopenai/whisper-large-v3
Fine-tuning methodLoRA
Task tokentranscribe
Language tokenLingala
Precisionbf16
OptimizerAdamW
Evaluation strategysmall random validation subsets during training
Final evaluationfull validation/test split
DatasetWaxal Lingala ASR

Performance

We report CER rather than WER for this release.

MetricValue
CER normalized0.1703

We do not report WER in this first release because WER is not fully fair for the current Lingala ASR setting. Lingala does not yet have a single widely enforced normalized orthography in our data, and WER strongly penalizes spelling variants, segmentation differences, and silence-related insertions/deletions. We plan to release a corrected WER metric that better accounts for linguistic and contextual variation.

Intended Use

This model can be used for:

  • Lingala speech transcription
  • research on low-resource ASR
  • dataset bootstrapping
  • assisted transcription before human correction
  • evaluation of ASR pipelines for Bantu languages

The model is especially useful as a first-pass transcription model before review by human annotators.

Limitations

This is an early release and still has important limitations:

  • silence handling still needs improvement
  • the model may hallucinate text during long silent regions
  • performance can degrade with music, jingles, intros, outros, and strong background noise
  • performance in real-world media with overlapping speech is still limited
  • the training data is not general enough to cover all common Lingala varieties
  • the model may struggle with recent slang, popular urban expressions, and code-switching
  • the model is not yet robust across all domains such as news, sermons, informal conversation, street interviews, and music-heavy content

Example Inference in a Notebook

python

!pip install -U transformers peft accelerate soundfile librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa
base_model = "openai/whisper-large-v3"
adapter_model = "BantuLanguagesInitiative/bli-asr-0"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
processor = WhisperProcessor.from_pretrained(
base_model,
language="lingala",
task="transcribe",
)
model = WhisperForConditionalGeneration.from_pretrained(
base_model,
torch_dtype=dtype,
)
model = PeftModel.from_pretrained(model, adapter_model)
model = model.merge_and_unload()
model.to(device)
model.eval()
audio_path = "example.mp3"
audio, sr = librosa.load(audio_path, sr=16000)
inputs = processor.feature_extractor(
audio,
sampling_rate=16000,
return_tensors="pt",
)
input_features = inputs.input_features.to(device=device, dtype=dtype)
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="lingala",
task="transcribe",
)
with torch.no_grad():
generated_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids,
max_new_tokens=225,
)
text = processor.tokenizer.batch_decode(
generated_ids,
skip_special_tokens=True,
)[0]
print(text)

Debug

If you get a PEFT/torchao version error in Colab, run:

python

!pip install -U torchao

Model provider

BantuLanguagesInitiative

Model tree

Base

openai/whisper-large-v3

Adapter

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today