Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitTraining data
Fine-tuned on fixie-ai/common_voice_17_0 —
en, streamed and decoded on the fly from a 1024-clip slice. Each clip's
language is pinned from its Common Voice locale during training.
Performance per language
Validation WER and CER at the best checkpoint, per Common Voice locale (also in the Evaluation Results metadata above):
| Language | WER | CER |
|---|---|---|
en | 0.228 | — |
How it was trained
Instead of cross-entropy against a single reference, for each audio clip the
policy samples a group of num_generations transcriptions, scores each by a
negated blend of word error rate, character error rate, and length / repetition
penalties, and is nudged toward the better candidates with a clipped
policy-gradient objective regularized by a per-token KL penalty to the frozen
base model. Advantages are the group-relative, standardized rewards
(A = (r - mean) / (std + eps)), so no value network is needed. The clip's
language is pinned from its Common Voice locale, and the policy's own greedy
transcriptions are scored as validation WER and CER.
Hyperparameters
| Field | Value |
|---|---|
| Base model | openai/whisper-tiny |
| Dataset | fixie-ai/common_voice_17_0 |
| Learning rate | 3e-06 |
| Sampling temperature | 0.7 |
| Group size (generations/clip) | 8 |
| KL penalty (β) | 0.04 |
| Batch size (clips/step) | 4 |
| Max optimizer steps | 500 |
| Warmup steps | 20 |
Training curves
Pulled from the Weights & Biases run (static snapshot):

Usage
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition", model="wrice/whisper-tiny-grpo-72a3573-fixed-en-t0.7-lr3e-6")print(asr("audio.wav")["text"])
Limitations
A proof-of-concept GRPO recipe, not a tuned production system. WER and CER are reported on a held-out Common Voice validation slice after text normalization; real-world performance varies by domain, accent, language, and audio quality.
Model provider
wrice
Model tree
Base
openai/whisper-tiny
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information