pradachan/whisper-large-v3-turbo-disfluency-lora API & Inference Endpoint

Results (DisfluencySpeech test split)

Every WER figure below is whisper_norm WER-C against transcript_c (the fluent reference) on the DisfluencySpeech test split, a single-speaker, acted benchmark, N=250, 95% CI ±~1pp.

model	WER-C (whisper_norm)	WER-C (punct_strip)
whisper-large-v3-turbo (vanilla)	9.4%	10.4%
whisper + v1 disfluency LoRA	8.9%	9.5%
this adapter	3.4%	3.3%

Against vanilla, the adapter improves whisper_norm WER-C by 6.0pp (95% bootstrap CI [+5.0, +7.0]); against the earlier v1 adapter, by 5.5pp (95% bootstrap CI [+4.4, +6.6]). The checkpoint was selected on the validation split. The test split was scored exactly once, after selection.

This appears to be the first published ASR WER on this benchmark's fluent tier. The dataset paper, arXiv:2406.08820, reports no ASR fine-tuning baselines.

One caveat about the metric: whisper_norm applies the OpenAI English text normalizer to both the hypothesis and the reference, and that normalizer deletes "um/uh/hmm/mm". Filled-pause removal is therefore invisible to this number. The measured gains come from discourse markers, repetitions, and self-repairs, not from deleting fillers.

False-positive behavior (probe suites)

The benchmark cannot measure several failure modes that matter in practice: it has no digit strings and only four immediate repeats. To get at those, we score ~100 probes (25 per category) rendered with Kokoro TTS. These probes are synthetic audio, not human speech, so read them as a rough signal rather than a field measurement.

category	vanilla	v1 LoRA	this adapter
intentional repetition preserved (digits, emphasis)	84%	24%	20%
hedge-as-content kept (precision)	100%	92%	92%
filler hedge removed (recall)	0%	67%	75%
self-repair resolved correctly	4%	60%	40%
proper-noun recall near disfluencies	96%	80%	80%

Known limitations

The low WER comes from aggressive deletion, and that has costs you should account for before deploying:

Intentional repetition is usually lost. The adapter preserves a repeated span only about 20% of the time, so repeated digits, spelled IDs, and phone numbers get corrupted ("zero zero one two" can come back as "zero one two"). Do not run it on numeric dictation without a validation step downstream.

Self-repairs are resolved correctly only about 40% of the time. In roughly half the remaining cases it keeps both the false start and the correction ("Tuesday no wait Wednesday" yields both). Treat any corrected fact in the output as unconfirmed.

Proper nouns next to disfluencies survive about 80% of the time, so occasional name loss is expected.

Usage

python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel

base = "openai/whisper-large-v3-turbo"
processor = WhisperProcessor.from_pretrained(base)
model = WhisperForConditionalGeneration.from_pretrained(base)
model = PeftModel.from_pretrained(model, "pradachan/whisper-large-v3-turbo-disfluency-lora")
# decode with language="en", task="transcribe"; greedy decoding (num_beams=1)

Training

The adapter is LoRA with r=16, alpha=32, rsLoRA enabled, on openai/whisper-large-v3-turbo. The target modules (full list in adapter_config.json) are the attention q/v projections, the decoder layers' attention (k/out) and feed-forward (fc1/fc2), and the encoder cross-attention (k/out).

Training data was about 18,600 synthetic disfluent utterances that we built ourselves, plus the DisfluencySpeech train split (4,500 real human utterances). The synthetic half starts from clean LibriSpeech transcripts (text only, the original audio is discarded), injects fillers, repetitions, and self-repairs into that text, then voices the result with Kokoro TTS rotated across 54 voices. So the model learns from LibriSpeech-derived synthetic speech, not from DisfluencySpeech itself; DisfluencySpeech contributes the smaller real-human blend and serves as the held-out benchmark. Real-speech labels use transcript_c normalized to lowercase with apostrophes kept and other punctuation stripped, which matches the synthetic label convention.

The train and test splits of DisfluencySpeech share one speaker, but their utterances are disjoint: we found 0 exact matches and 0 near-duplicates between train and test utterance texts (maximum token-level Jaccard of any test utterance against any train utterance was 0.40).

The schedule was peak LR 1e-4 with warmup and decay, a 3,500-step run saving a checkpoint every 250 steps. The shipped checkpoint is step 2000, chosen for the lowest whisper_norm WER-C on the validation split. Checkpoint selection was done offline on validation; the in-trainer metrics were for monitoring only.

Intended use and caveats

Use it for dictation and transcript cleanup, where fluent and readable output is the goal. It deletes speech by design, so do not use it for verbatim, legal, or evidentiary transcription.

The evaluation benchmark is one speaker reading acted scripts. Transfer to other voices, accents, L2 English, and spontaneous conversational audio is not established. That is a different question from utterance overlap above: disjoint utterances do not imply speaker or domain generalization.

License

apache-2.0. The adapter weights are released under Apache-2.0, and every training-data source is openly and commercially licensed: LibriSpeech transcripts are CC BY 4.0, DisfluencySpeech is Apache-2.0, and Kokoro TTS is Apache-2.0. PodcastFillers was deliberately left out, since its annotations are non-commercial.

whisper-large-v3-turbo-disfluency-lora

Get help setting up a custom Dedicated Endpoints.

README