wrice/whisper-tiny-grpo-72a3573-fixed-en-t0.7-lr3e-6 API & Inference Endpoint

Training data

Fine-tuned on fixie-ai/common_voice_17_0 — en, streamed and decoded on the fly from a 1024-clip slice. Each clip's language is pinned from its Common Voice locale during training.

Performance per language

Validation WER and CER at the best checkpoint, per Common Voice locale (also in the Evaluation Results metadata above):

Language	WER	CER
`en`	0.228	—

How it was trained

Instead of cross-entropy against a single reference, for each audio clip the policy samples a group of num_generations transcriptions, scores each by a negated blend of word error rate, character error rate, and length / repetition penalties, and is nudged toward the better candidates with a clipped policy-gradient objective regularized by a per-token KL penalty to the frozen base model. Advantages are the group-relative, standardized rewards (A = (r - mean) / (std + eps)), so no value network is needed. The clip's language is pinned from its Common Voice locale, and the policy's own greedy transcriptions are scored as validation WER and CER.

Hyperparameters

Field	Value
Base model	`openai/whisper-tiny`
Dataset	`fixie-ai/common_voice_17_0`
Learning rate	`3e-06`
Sampling temperature	`0.7`
Group size (generations/clip)	`8`
KL penalty (β)	`0.04`
Batch size (clips/step)	`4`
Max optimizer steps	`500`
Warmup steps	`20`

Training curves

Pulled from the Weights & Biases run (static snapshot):

training curves

Usage

python
from transformers import pipeline

asr = pipeline("automatic-speech-recognition", model="wrice/whisper-tiny-grpo-72a3573-fixed-en-t0.7-lr3e-6")
print(asr("audio.wav")["text"])

Limitations

A proof-of-concept GRPO recipe, not a tuned production system. WER and CER are reported on a held-out Common Voice validation slice after text normalization; real-world performance varies by domain, accent, language, and audio quality.

whisper-tiny-grpo-72a3573-fixed-en-t0.7-lr3e-6

Get help setting up a custom Dedicated Endpoints.

README