Usage
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="Zilai2999/whisper-small.en-gumbel-beard",
)
print(asr("audio.wav")["text"])
Or with the model/processor directly:
import torch, librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "Zilai2999/whisper-small.en-gumbel-beard"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
audio, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
ids = model.generate(inputs.input_features)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
Audio is expected at 16 kHz.
Training
- Self-supervised Gumbel-BEARD adaptation of the Whisper encoder (BEST-RQ +
self-distillation with automatic Gumbel-softmax layer selection).
- Supervised ASR fine-tuning on the MyST children's speech corpus.
See the code repository for the full recipe and hyper-parameters.
Results
Table with columns: Test set, WER| Test set | WER |
|---|
| MyST | 8.5% |
Citation
@inproceedings{gumbelbeard2026,
title = {Gumbel-BEARD: Automatic Layer Selection for Self-Supervised
Adaptation of Whisper in Low-Resource Domains},
booktitle = {Proc. Interspeech},
year = {2026},
}