Trelis

Trelis

whisper-hinglish-preview

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Try it live

Available on Trelis Router (https://router.trelis.com/models):

  • UI — log in and upload an audio clip to transcribe right in the browser.
  • APIPOST https://router.trelis.com/api/v1/transcribe for programmatic access (requires an API key).

Evaluation

Corpus WER (%, lower is better) under a script-safe indic-hindi normaliser (NFC + Indic normalisation, keeps Devanagari matras/nuktas, strips punctuation; not the Whisper default, which strips matras and inflates Devanagari WER). Compared against two leading commercial APIs: Sarvam (Saaras-v3) and ElevenLabs Scribe-v2.

🟠 Hinglish — code-switched (Hindi + English in one utterance, each in their native script)

Table
Benchmarkwhisper-hinglish-previewSarvamScribe-v2whisper-large-v3Vaani
CoSHE-500 (conversational CS)13.6711.47 ᶜᵐ12.4329.7473.96
cs-fleurs (read CS)10.1916.47 ᶜᵐ7.5733.9234.12
hiacc-adult (accented CS)12.7314.44 ᶜᵐ16.9828.5360.09
hiacc-child (accented CS)10.6914.11 ᶜᵐ18.3627.9132.17

🔵 Hindi (pure Devanagari)

Table
Benchmarkwhisper-hinglish-previewSarvamScribe-v2whisper-large-v3Vaani
Common Voice Hindi (cv-hi)12.8612.4013.4430.8214.48
FLEURS-hi12.5710.0711.3327.5011.58

⚪ English

Table
Benchmarkwhisper-hinglish-previewSarvamScribe-v2whisper-large-v3Vaani
FLEURS-en6.935.144.014.81101.66

Bold = best on that row.


How to use

Like any Whisper model, specify the language when you transcribe.

python

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf, torch
repo = "Trelis/whisper-hinglish-preview"
proc = WhisperProcessor.from_pretrained(repo)
model = WhisperForConditionalGeneration.from_pretrained(repo, torch_dtype=torch.bfloat16).to("cuda").eval()
audio, sr = sf.read("clip.wav") # 16 kHz mono
feat = proc.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", torch.bfloat16)
# Hindi audio → force <|hi|> ; English audio → force <|en|>
ids = proc.tokenizer.convert_tokens_to_ids
prompt = [ids("<|startoftranscript|>"), ids("<|hi|>"), ids("<|transcribe|>"), ids("<|notimestamps|>")]
out = model.generate(input_features=feat,
decoder_input_ids=torch.tensor([prompt]).to("cuda"),
max_new_tokens=440)
print(proc.tokenizer.decode(out[0], skip_special_tokens=True))

Code-switched audio. The model uses a dedicated <|mixedcode|> marker/token for utterances that mix Devanagari and Latin script. Insert it right after the language token, choosing the language token by the dominant script of the utterance:

python

mc = proc.tokenizer("<|mixedcode|>", add_special_tokens=False).input_ids
prompt = [ids("<|startoftranscript|>"), ids("<|hi|>"), *mc, ids("<|transcribe|>"), ids("<|notimestamps|>")]

Disclaimers

  • Commercial-API WERs on pure Hindi benchmarks here are pessimistic. Sarvam and Scribe keep English loanwords in Latin script and numbers as digits, whereas our references render everything in Devanagari. A translit-blind WER then charges a substitution per loanword/number against them. The comparison is apples-to-apples on our Devanagari-reference protocol, not a claim about their raw quality.
  • ᶜᵐ Sarvam evaluated in its code-mixed mode.
  • Specify the language (<|hi|> / <|en|>) as shown above — standard Whisper usage — for the reported quality.

Attributions

Model provider

Trelis

Trelis

Model tree

Base

ARTPARK-IISc/whisper-large-v3-vaani-hindi

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today