Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Model Description

Standard ASR models trained on monolingual data degrade significantly on code-switched speech — sentences where Tamil and English are mixed mid-utterance. This model targets that gap through targeted fine-tuning: training data is weighted to oversample code-switched segments and high switch-point samples, guided by a structured failure taxonomy.

Value
Base modelopenai/whisper-small
Fine-tuning methodLoRA (PEFT)
LoRA rank32
LoRA alpha64
Target modulesq_proj, v_proj
Training dataIndicVoices Tamil (1500 samples, stratified)
LanguagesTamil (ta), English (en), Tamil-English mixed

Intended Use

  • Transcription of Tamil-English code-switched (Tanglish) speech
  • Voice interfaces and STT pipelines for urban Indian users
  • Research baseline for code-switched Indic ASR

Out of scope: Clean monolingual Tamil or English at scale — use openai/whisper-medium or ai4bharat/indicwav2vec for monolingual speech.

Evaluation Results

WER on held-out test set (synthetic Tamil-English code-switched corpus), stratified by segment type:

Segment TypeWhisper-small (baseline)Whisper-tamil-mediumThis model
Overall0.9760.8290.682
Monolingual Tamil0.9570.6880.769
Monolingual English1.0090.9800.566
Code-switched0.9640.8790.564
CS Penalty (×)0.98×1.05×0.84×

41.5% relative WER reduction on code-switched speech vs. Whisper-small baseline. 36% improvement over the best pre-trained Tamil-specialized model.

CS Penalty = code-switched WER ÷ average(mono-Tamil WER, mono-English WER). A value below 1.0 means the model handles code-switched speech better than monolingual speech — the opposite of all three baselines.

See full results in the training repository.

Failure Taxonomy

The fine-tuning strategy was derived from a structured analysis of 5 failure categories observed across all baselines:

CategoryDescriptionWhisper-smallWhisper-tamilWav2Vec2-tamilOurs (LoRA)
SUBSTITUTION_SWITCHError at a Tamil↔English switch boundary46%46%64%58%
LANGUAGE_CONFUSIONTamil word output in English script or vice versa54%54%36%41%
DELETION_PROPER_NOUNNamed entity deleted from output0%0%0%0%
SUBSTITUTION_NUMBERNumber or date transcribed incorrectly0%0%0%0%
INSERTION_FILLERHallucinated filler (um, uh, like)0%0%0%1%

Only SUBSTITUTION_SWITCH and LANGUAGE_CONFUSION were observed — both are systemic architectural blind spots shared across all models, not model-specific bugs. Fine-tuning reduced LANGUAGE_CONFUSION from 54% → 41% but did not eliminate either category.

Training Procedure

Data sampling (targeted oversampling):

  • Code-switched segments: ×3
  • Segments with >2 language switch points: ×2
  • Monolingual segments: ×0.5 (undersampled)

Hyperparameters:

  • Epochs: 3
  • Batch size: 4 (effective: 16 with gradient accumulation ×4)
  • Learning rate: 1e-3 with 50 warmup steps
  • Optimizer: AdamW 8-bit
  • Precision: FP16
  • Early stopping: patience 3, metric WER

How to Use

python

import torch
import numpy as np
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
base_model_id = "openai/whisper-small"
adapter_model_id = "Dhanush66-rv/whisper-small-tanglish-lora"
processor = WhisperProcessor.from_pretrained(adapter_model_id)
base = WhisperForConditionalGeneration.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(base, adapter_model_id)
model.eval()
# audio: np.ndarray, mono float32, 16kHz
def transcribe(audio: np.ndarray) -> str:
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
with torch.no_grad():
ids = model.generate(inputs, language="ta", task="transcribe")
return processor.batch_decode(ids, skip_special_tokens=True)[0].strip()

Or via the FastAPI endpoint (see api/app.py in the training repo):

bash

uvicorn api.app:app --port 8000
curl -X POST http://localhost:8000/transcribe -F "audio=@speech.wav"

Limitations

  • Trained on 1500 samples — a small corpus. Performance on diverse speakers, accents, and domains will vary.
  • Language detection for segment tagging uses langdetect, which can misclassify short Tamil-script words.
  • Numbers and proper nouns (especially transliterated names) remain a known weak point — see DELETION_PROPER_NOUN and SUBSTITUTION_NUMBER failure categories.
  • Not evaluated on spontaneous conversational speech; training data is read-speech from IndicVoices.

Citation

bibtex

@misc{whisper-small-tanglish-lora,
author = {Dhanush, R V},
title = {Whisper-small fine-tuned for Tamil-English code-switched ASR},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Dhanush66-rv/whisper-small-tanglish-lora}
}

Model provider

Dhanush66-rv

Model tree

Base

openai/whisper-small

Adapter

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today