Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
- Developed by: [Your Name / Username]
- Model Type: Sequence-to-Sequence Generative Audio Transformer (Whisper)
- Language(s): Audio Input: Korean (ko) -> Text Output: English (en)
- Task: End-to-End Translation (
task="translate") - Finetuned from model:
openai/whisper-small - Quantization: 4-bit NormalFloat (
NF4) viabitsandbytes
Training & Dataset Architecture
The model was trained on a robust 66-hour master dataset fusion matrix, balancing domain-specific variety show vocabulary with general conversational stability using a precise 70:30 ratio:
- Show-Centric Data (70%): 46 hours of clean broadcast audio slices paired with custom translated English subtitles mapping complex baseball jargon and variety show filler phrases.
- General Data (30%): 20 hours of conversational multi-domain Korean audio to stabilize English sentence structure and prevent domain-specific catastrophic forgetting, pulled via:
didiudom94/my-zeroth-audio-dataset-with-text(4,000 balanced rows)Bingsu/KSS_Dataset(10,000 balanced rows)
LoRA Hyperparameters
- Rank (r): 32
- Alpha (α): 64
- Target Modules: Core Attention Projection Vectors (
["q_proj", "v_proj", "out_proj", "k_proj"]) - Trainable Parameters: 7,077,888 (~2.84% of total weights)
Training Hyperparameters
- Hardware Platform: NVIDIA A100 / L4 GPU Ecosystem
- Effective Batch Size: 32 (
per_device_train_batch_size=32,gradient_accumulation_steps=1) - Precision: Native
bfloat16(bf16=True,tf32=True) - Learning Rate: 3e-5 (Linear decay with 500 warmup steps)
- Epochs: 2 full passes over the fused master dataset
- Optimization Metric Matrix: Cross-Entropy Validation Loss (
metric_for_best_model="loss")
Empirical Performance Benchmarks
Evaluation was run using a 1,000-row unseen evaluation slice drawn directly from the master test matrix. The model demonstrates a near-doubling of string-level contextual translation accuracy over the original out-of-the-box Whisper Small base weights.
| Model Variant | Corpus-Level BLEU Score | Translation Alignment Behavior |
|---|---|---|
| 📉 Original Base OpenAI Whisper Small | 5.55 | Struggles with colloquial variety filters; outputs pure Korean transcription or severe string hallucinations. |
| 🚀 Fine-Tuned Burn-to-Win V2 (This Model) | 10.68 | Accurately interprets sports enthusiasm and intent, generating fluent human-like subbing choices. |
Visual Ground-Truth Subtitle Inspection Sample
- Ground Truth (Reference):
how is he that fast ? - Fine-Tuned Model Output:
why is he so fast ?(Highly contextual, fluent substitution) - Base OpenAI Whisper Output:
we should find nobody(Complete hallucination)
How to Use & Deploy
To load this fine-tuned checkpoint directly from the Hugging Face Hub using the peft and transformers ecosystem, use the following code snippet:
python
import torchfrom transformers import WhisperProcessor, WhisperForConditionalGenerationfrom peft import PeftModel, PeftConfigpeft_model_id = "YOUR_HF_USERNAME/YOUR_REPO_NAME"# 1. Load base configuration and processorconfig = PeftConfig.from_pretrained(peft_model_id)processor = WhisperProcessor.from_pretrained(config.base_model_name_or_path, language="english", task="translate")# 2. Initialize underlying base architecturebase_model = WhisperForConditionalGeneration.from_pretrained(config.base_model_name_or_path,torch_dtype=torch.bfloat16,attn_implementation="sdpa",device_map="auto")# 3. Inject fine-tuned adapter layersmodel = PeftModel.from_pretrained(base_model, peft_model_id)model.eval()print("🎉 Custom translation engine deployed successfully!")
Limitations & Best Practices
- Audio Clutter: While the model handles high background noise and stadium cheering exceptionally well, rapid cross-talking during loud variety segments may cause missing caption links.
- Long-Form Media: For video files over 30 minutes, it is highly recommended to run inference using
stable-ts(Stable Whisper) with argumentschunk_size=30andcondition_on_previous_text=Falseto maintain constant timestamp stability and eliminate trailing token duplication bugs.
Model provider
didiudom94
Model tree
Base
openai/whisper-small
Adapter
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information