Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
- Model name: BLI ASR 0
- Task: Automatic Speech Recognition
- Language: Lingala
- Base model:
openai/whisper-large-v3 - Adaptation method: LoRA / PEFT
- Training dataset: Waxal Lingala ASR
- Output: Lingala transcription from speech audio
This model transcribes Lingala speech into text. It is not a translation model.
Dataset
The model was trained on the Waxal Lingala ASR dataset.
The dataset was split into:
| Split | Approx. number of samples | Usage |
|---|---|---|
| Train | 14,400 | Model training |
| Validation | 1,844 | Validation during development |
| Test | 1,866 | Final held-out evaluation |
Text Post-processing
We applied a light normalization pipeline to the training and evaluation transcriptions.
The goal was not to impose a strict Lingala orthography, but to reduce noise and improve consistency. The post-processing included:
- Unicode normalization
- lowercasing
- whitespace normalization
- punctuation and symbol cleanup
- preservation of the original raw transcription when available
- creation of a normalized transcription field used for training/evaluation
We intentionally avoided aggressive spelling correction because Lingala has substantial orthographic variation across speakers, regions, and data sources.
Training Details
The model was fine-tuned from openai/whisper-large-v3 using LoRA.
Main training choices:
| Parameter | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA |
| Task token | transcribe |
| Language token | Lingala |
| Precision | bf16 |
| Optimizer | AdamW |
| Evaluation strategy | small random validation subsets during training |
| Final evaluation | full validation/test split |
| Dataset | Waxal Lingala ASR |
Performance
We report CER rather than WER for this release.
| Metric | Value |
|---|---|
| CER normalized | 0.1703 |
We do not report WER in this first release because WER is not fully fair for the current Lingala ASR setting. Lingala does not yet have a single widely enforced normalized orthography in our data, and WER strongly penalizes spelling variants, segmentation differences, and silence-related insertions/deletions. We plan to release a corrected WER metric that better accounts for linguistic and contextual variation.
Intended Use
This model can be used for:
- Lingala speech transcription
- research on low-resource ASR
- dataset bootstrapping
- assisted transcription before human correction
- evaluation of ASR pipelines for Bantu languages
The model is especially useful as a first-pass transcription model before review by human annotators.
Limitations
This is an early release and still has important limitations:
- silence handling still needs improvement
- the model may hallucinate text during long silent regions
- performance can degrade with music, jingles, intros, outros, and strong background noise
- performance in real-world media with overlapping speech is still limited
- the training data is not general enough to cover all common Lingala varieties
- the model may struggle with recent slang, popular urban expressions, and code-switching
- the model is not yet robust across all domains such as news, sermons, informal conversation, street interviews, and music-heavy content
Example Inference in a Notebook
python
!pip install -U transformers peft accelerate soundfile librosaimport torchfrom transformers import WhisperProcessor, WhisperForConditionalGenerationfrom peft import PeftModelimport librosabase_model = "openai/whisper-large-v3"adapter_model = "BantuLanguagesInitiative/bli-asr-0"device = "cuda" if torch.cuda.is_available() else "cpu"dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32processor = WhisperProcessor.from_pretrained(base_model,language="lingala",task="transcribe",)model = WhisperForConditionalGeneration.from_pretrained(base_model,torch_dtype=dtype,)model = PeftModel.from_pretrained(model, adapter_model)model = model.merge_and_unload()model.to(device)model.eval()audio_path = "example.mp3"audio, sr = librosa.load(audio_path, sr=16000)inputs = processor.feature_extractor(audio,sampling_rate=16000,return_tensors="pt",)input_features = inputs.input_features.to(device=device, dtype=dtype)forced_decoder_ids = processor.get_decoder_prompt_ids(language="lingala",task="transcribe",)with torch.no_grad():generated_ids = model.generate(input_features,forced_decoder_ids=forced_decoder_ids,max_new_tokens=225,)text = processor.tokenizer.batch_decode(generated_ids,skip_special_tokens=True,)[0]print(text)
Debug
If you get a PEFT/torchao version error in Colab, run:
python
!pip install -U torchao
Model provider
BantuLanguagesInitiative
Model tree
Base
openai/whisper-large-v3
Adapter
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information