Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Training data

Fine-tuned on fixie-ai/common_voice_17_0en, streamed and decoded on the fly from a 1024-clip slice. Each clip's language is pinned from its Common Voice locale during training.

Performance per language

Validation WER and CER at the best checkpoint, per Common Voice locale (also in the Evaluation Results metadata above):

LanguageWERCER
en0.228

How it was trained

Instead of cross-entropy against a single reference, for each audio clip the policy samples a group of num_generations transcriptions, scores each by a negated blend of word error rate, character error rate, and length / repetition penalties, and is nudged toward the better candidates with a clipped policy-gradient objective regularized by a per-token KL penalty to the frozen base model. Advantages are the group-relative, standardized rewards (A = (r - mean) / (std + eps)), so no value network is needed. The clip's language is pinned from its Common Voice locale, and the policy's own greedy transcriptions are scored as validation WER and CER.

Hyperparameters

FieldValue
Base modelopenai/whisper-tiny
Datasetfixie-ai/common_voice_17_0
Learning rate3e-06
Sampling temperature0.7
Group size (generations/clip)8
KL penalty (β)0.04
Batch size (clips/step)4
Max optimizer steps500
Warmup steps20

Training curves

Pulled from the Weights & Biases run (static snapshot):

training curves

Usage

python

from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="wrice/whisper-tiny-grpo-72a3573-fixed-en-t0.7-lr3e-6")
print(asr("audio.wav")["text"])

Limitations

A proof-of-concept GRPO recipe, not a tuned production system. WER and CER are reported on a held-out Common Voice validation slice after text normalization; real-world performance varies by domain, accent, language, and audio quality.

Model provider

wrice

wrice

Model tree

Base

openai/whisper-tiny

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today