Model summary
Table with columns: Field, Value| Field | Value |
|---|
| Architecture | OpenAI Whisper large-v3-turbo + PEFT LoRA (merged full weights) |
| Base | jsbeaudry/oswald-large-v3-turbo-m1 |
| Training | LoRA r=32, alpha=64, 4 epochs, grouped script-level val split |
| Final validation WER | 9.21% (epoch 4) |
| Training region | Modal us-central (US), A100-80GB |
| Output language | Haitian Creole (ht) medical phrasing |
Intended use
- Inbound/outbound clinical phone interpretation (triage, pharmacy, vitals, anatomy).
- Telephony-degraded audio (8 kHz band-limited synthetic + real 16 kHz phone captures).
- Research and internal QA on Haitian Creole medical ASR.
Not for: diagnosis, autonomous clinical decision-making, or use without human review of transcripts in care settings.
Limitations
- Trained on scripted / semi-scripted phrases, not full spontaneous clinical dialogue.
- Mixed French loanwords and orthography variants remain challenging.
- 10 in-car driving clips (Oswald auto-labels) add noise diversity but are not script-verified.
- No guarantee of HIPAA “safe harbor” de-identification in runtime transcripts — operators must follow org policies.
- CT2 deployment (
*-ct2 repos) requires separate conversion; this repo is HuggingFace transformers weights.
Training data (1,592 clips)
Unified corpus: ht_training_combined (Veyatia internal assembly, May 2026).
Table with columns: Source, Clips, Sample rate, Description| Source | Clips | Sample rate | Description |
|---|
| Telephony TTS | 793 | 8 kHz mono | PSTN-simulated medical phrases (300–3400 Hz bandpass, line noise) |
| Synthetic TTS | 739 | 16 kHz mono | ElevenLabs clinical combinator corpus (filler + real-dup rows excluded) |
| Real phone | 50 | 16 kHz mono | Native speaker, scripted medical phrases (iPhone capture) |
| Driving record | 10 | 16 kHz mono |
Excluded from synthetic manifest: 104 filler templates, 9 duplicates of real-phone text, 148 missing WAV (not in training run).
PHI: Training text is synthetic or scripted; no patient-identifiable fields were intentionally collected. Driving clips are operator-recorded medical phrase practice, not live patient encounters.
Evaluation methodology
- Metric: word error rate (WER) via
jiwer on validation set.
- Split: ~10% of script groups held out (not random clips), so the same sentence does not appear in train and val.
- Eval:
predict_with_generate=True each epoch; best checkpoint by lowest eval_wer.
- Hardware: NVIDIA A100-80GB; mixed precision (
fp16 autocast, float32 weights).
WER by epoch (validation)
Table with columns: Epoch, eval WER| Epoch | eval WER |
|---|
| 1 | 14.89% |
| 2 | 10.45% |
| 4 | 9.21% |
(Epoch 3 checkpoint not retained as best; epoch 4 selected.)
Training procedure
- Base model:
jsbeaudry/oswald-large-v3-turbo-m1 (HF transformers, not CT2).
- LoRA on attention projections (
q_proj, v_proj, k_proj, out_proj).
- Optimizer: AdamW, LR
8e-5, warmup 5%, batch 4 × grad accum 4 (effective 16).
- Audio: librosa load at 16 kHz; 8 kHz telephony rows upsampled at train time.
- Merge: LoRA adapter merged into base weights → this repository.
- Infra: Modal serverless GPU; corpus on private volume
veyatia-training-data.
HIPAA & deployment notes (operators)
- Training: No intentional PHI in corpus; US-region compute (
us-central).
- Inference: Transcripts may contain user-spoken PHI at runtime — treat outputs as sensitive; use encryption, access controls, audit logging, and BAAs as required.
- Retention: Do not use this model card or public HF artifacts to store patient audio or transcripts.
- BAA: HuggingFace Hub hosting is separate from your clinical BAA stack — confirm compliance with your security team.
How to load
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "veyatia/whisper-creole-medical-v4"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
Forced language: Haitian Creole (ht). For production telephony, prefer CT2 export for CPU inference (see Veyatia RunPod docs).
Version history
Table with columns: Version, Base, Notes| Version | Base | Notes |
|---|
| v1 | openai/whisper-large-v3-turbo | Early Creole medical experiments |
| v2 | Internal Creole HF checkpoint | Pilot WER ~0.273 on clip set |
| v3 | veyatia/whisper-creole-medical-v3 (+ -ct2) | Prior production CT2 path |
| v4 | jsbeaudry/oswald-large-v3-turbo-m1 + LoRA | This release — 1,592-clip corpus, 9.21% val WER |
Developed by Veyatia Health for Haitian Creole medical interpretation infrastructure.
For issues: open a discussion on this repo or contact your Veyatia operator.
License
Apache 2.0 (aligned with base Oswald / Whisper ecosystem). Verify third-party voice data licenses for any new fine-tuning you perform on top of this checkpoint.