pr0mila-gh0sh
MediBeng-Whisper-Tiny-FL
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
MediBeng Whisper Tiny FL is the federated learning (v3) release of Whisper Tiny for automatic speech translation of code-switched Bengali–English clinical conversations into English. Training uses FedProx across 4 simulated hospital clients with speaker-based non-IID data partitions — raw audio never leaves each client; only aggregated weight updates are shared.
This release is the best-performing FL checkpoint (FedProx speaker). It substantially outperforms centralised v2 under the same free-generation evaluation protocol.
What’s New in v3 (Federated Learning)
| Area | v2 (centralised) | v3 FL (this release) |
|---|---|---|
| Training paradigm | Single-server centralised fine-tuning | Federated learning (4 clients, 3 rounds) |
| Privacy | All training data pooled centrally | Raw audio stays on clients; only weight updates aggregated |
| Algorithms | Seq2SeqTrainer only | FedAvg + FedProx (this model: FedProx) |
| Client partitioning | N/A | Speaker non-IID (Male/Female hospital shards) |
| Non-IID handling | N/A | FedProx proximal term (μ = 0.01) reduces client drift |
| Local training | 500 central steps | 50 steps/round × 3 rounds × 4 clients = 600 client steps |
| Eval protocol | Free + forced generation (v2 discrepancy) | Free generation (canonical); free/forced consistent |
| Best test WER | 28.20% (centralised v2, free gen) | 3.12% (FedProx speaker, free gen) |
| Best test BLEU | 73.06 | 96.33 |
| Statistical testing | Bootstrap / McNemar vs baseline | Full suite for FL vs baseline and vs centralised v2 |
Federated architecture
markdown
┌─────────────────────────┐│ FL Server (Aggregator) ││ FedProx (μ = 0.01) │└────────────┬────────────┘│ broadcast global weights┌─────────────────────┼─────────────────────┐│ │ │┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐│ Client 0 │ │ Client 1 │ ... │ Client 3 ││ Male shard │ │ Male shard │ │ Female shard││ 863 samples │ │ 863 samples │ │ 843 samples ││ 50 local │ │ 50 local │ │ 50 local ││ steps/round │ │ steps/round │ │ steps/round │└──────┬──────┘ └──────┬──────┘ └──────┬──────┘└─────────────────────┼─────────────────────┘│ upload weight updates┌────────────▼────────────┐│ Weighted Aggregation │└─────────────────────────┘
Training configuration
| Parameter | Value |
|---|---|
| Base checkpoint | openai/whisper-tiny |
| Algorithm | FedProx (μ = 0.01) |
| Clients | 4 (speaker non-IID) |
| FL rounds | 3 |
| Local steps per round | 50 |
| Local learning rate | 1e-5 |
| Local batch size | 1 |
| Optimizer | AdamW |
| Total FL training time | ~21 min (CPU) |
Usage
Install dependencies:
bash
pip install transformers librosa torch datasets
Run inference (free generation — matches evaluation protocol):
python
from transformers import WhisperProcessor, WhisperForConditionalGenerationimport librosaMODEL_ID = "pr0mila-gh0sh/MediBeng-Whisper-Tiny-FL"processor = WhisperProcessor.from_pretrained(MODEL_ID)model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)# Free generation (canonical FL evaluation protocol)model.config.forced_decoder_ids = Nonemodel.generation_config.forced_decoder_ids = Noneaudio, _ = librosa.load("path_to_audio.wav", sr=16000)input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_featurespredicted_ids = model.generate(input_features, max_length=225)translation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]print("Translation:", translation)
Intended Use
For researchers and developers building privacy-preserving clinical AST systems:
- Multi-hospital deployment without centralising patient audio
- Federated fine-tuning research on code-switched clinical speech
- Bengali–English medical translation in regulated environments
Training Data
Fine-tuned via federated learning on the MediBeng training split, partitioned across 4 simulated clients:
| Client | Partition | Samples |
|---|---|---|
| 0 | Male speaker shard A | 863 |
| 1 | Male speaker shard B | 863 |
| 2 | Female speaker shard A | 843 |
| 3 | Female speaker shard B | 843 |
Test evaluation: held-out 960-sample HF test split (identical to v2).
Evaluation Results
Full test set (n = 960, free generation)
| Setting | WER ↓ | BLEU-4 ↑ | chrF++ ↑ |
|---|---|---|---|
| Baseline (Whisper Tiny, unfine-tuned) | 81.12% | 32.20 | 45.12 |
| Centralised v2 | 28.20% | 73.06 | 79.28 |
| FL FedAvg IID | 6.16% | 93.66 | 94.38 |
| FL FedProx speaker (this model) | 3.12% | 96.33 | 96.70 |
FedProx convergence (quick eval, n = 200 per round)
| Round | Quick WER | Quick BLEU | Quick chrF++ |
|---|---|---|---|
| 1 | 36.64% | 57.94 | 63.87 |
| 2 | 7.02% | 92.25 | 93.39 |
| 3 | 3.09% | 96.35 | 96.71 |
Statistical significance (FedProx vs baseline)
| Test | Result |
|---|---|
| Bootstrap 95% CI (WER) | [2.41%, 3.46%] |
| Paired t-test | p = 6.95 × 10⁻¹⁶ |
| McNemar (≤5% WER) | 840/960 FL correct vs 0/960 baseline |
| Effect size (Cohen's d) | 0.27 |
Limitations
- Simulated clients — partitions use synthetic TTS speaker labels, not real multi-hospital data.
- 3 FL rounds — convergence may improve with more rounds.
- Centralised upper bound — comparison vs under-trained centralised v2; fully converged centralised baseline pending.
- Full weight transmission — ~156 MB per client per round; LoRA-FL not yet implemented.
- No differential privacy — formal ε–δ guarantees not yet added.
Ethical Considerations
- Federated learning reduces raw-data exposure but does not eliminate all privacy risks (model inversion, membership inference).
- Training data may reflect demographic biases; validate before clinical deployment.
- Human review required for all clinical translations.
Blog Post
MediBeng Whisper-Tiny: Translating Code-Switched Bengali-English Speech for Healthcare
Citation
Preprint on medRxiv.
bibtex
@article{ghosh2025medibeng,title={MediBeng Whisper Tiny: A fine-tuned code-switched Bengali-English translator for clinical applications},author={Ghosh, Promila and Talukder, Sunipun},journal={medRxiv},year={2025},doi={https://doi.org/10.1101/2025.04.25.25326406},url={https://www.medrxiv.org/content/10.1101/2025.04.25.25326406v2}}
Model provider
pr0mila-gh0sh
Model tree
Base
openai/whisper-tiny
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information