Zilai2999/whisper-small.en-gumbel-beard API & Inference Endpoint

Usage

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="Zilai2999/whisper-small.en-gumbel-beard",
)
print(asr("audio.wav")["text"])

Or with the model/processor directly:

python
import torch, librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "Zilai2999/whisper-small.en-gumbel-beard"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
ids = model.generate(inputs.input_features)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Audio is expected at 16 kHz.

Training

Self-supervised Gumbel-BEARD adaptation of the Whisper encoder (BEST-RQ + self-distillation with automatic Gumbel-softmax layer selection).
Supervised ASR fine-tuning on the MyST children's speech corpus.

See the code repository for the full recipe and hyper-parameters.

Results

Table with columns: Test set, WER
Test set	WER
MyST	8.5%

Citation

bibtex
@inproceedings{gumbelbeard2026,
  title     = {Gumbel-BEARD: Automatic Layer Selection for Self-Supervised
               Adaptation of Whisper in Low-Resource Domains},
  booktitle = {Proc. Interspeech},
  year      = {2026},
}

Usage

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="Zilai2999/whisper-small.en-gumbel-beard",
)
print(asr("audio.wav")["text"])

Or with the model/processor directly:

python
import torch, librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "Zilai2999/whisper-small.en-gumbel-beard"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
ids = model.generate(inputs.input_features)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Audio is expected at 16 kHz.

Training

Self-supervised Gumbel-BEARD adaptation of the Whisper encoder (BEST-RQ + self-distillation with automatic Gumbel-softmax layer selection).
Supervised ASR fine-tuning on the MyST children's speech corpus.

See the code repository for the full recipe and hyper-parameters.

Results

Table with columns: Test set, WER
Test set	WER
MyST	8.5%

Citation

bibtex
@inproceedings{gumbelbeard2026,
  title     = {Gumbel-BEARD: Automatic Layer Selection for Self-Supervised
               Adaptation of Whisper in Low-Resource Domains},
  booktitle = {Proc. Interspeech},
  year      = {2026},
}

whisper-small.en-gumbel-beard

Get help setting up a custom Dedicated Endpoints.

README

Usage

Training

Results

Citation

Explore FriendliAI today

README

Usage

Training

Results

Citation