didiudom94/whisper-small-ko-to-en-v4-cross-attention API & Inference Endpoint

Model Description

Developed by: [Your Name / Username]
Model Type: Sequence-to-Sequence Generative Audio Transformer (Whisper)
Language(s): Audio Input: Korean (ko) -> Text Output: English (en)
Task: End-to-End Translation (task="translate")
Finetuned from model: openai/whisper-small
Quantization: 4-bit NormalFloat (NF4) via bitsandbytes

Training & Dataset Architecture

The model was trained on a robust 66-hour master dataset fusion matrix, balancing domain-specific variety show vocabulary with general conversational stability using a precise 70:30 ratio:

Show-Centric Data (70%): 46 hours of clean broadcast audio slices paired with custom translated English subtitles mapping complex baseball jargon and variety show filler phrases.
General Data (30%): 20 hours of conversational multi-domain Korean audio to stabilize English sentence structure and prevent domain-specific catastrophic forgetting, pulled via:
- didiudom94/my-zeroth-audio-dataset-with-text (4,000 balanced rows)
- Bingsu/KSS_Dataset (10,000 balanced rows)

LoRA Hyperparameters

Rank ( $r$ ): 32
Alpha ( $α$ ): 64
Target Modules: Core Attention Projection Vectors (["q_proj", "v_proj", "out_proj", "k_proj"])
Trainable Parameters: 7,077,888 (~2.84% of total weights)

Training Hyperparameters

Hardware Platform: NVIDIA A100 / L4 GPU Ecosystem
Effective Batch Size: 32 (per_device_train_batch_size=32, gradient_accumulation_steps=1)
Precision: Native bfloat16 (bf16=True, tf32=True)
Learning Rate: 3e-5 (Linear decay with 500 warmup steps)
Epochs: 2 full passes over the fused master dataset
Optimization Metric Matrix: Cross-Entropy Validation Loss (metric_for_best_model="loss")

Empirical Performance Benchmarks

Evaluation was run using a 1,000-row unseen evaluation slice drawn directly from the master test matrix. The model demonstrates a near-doubling of string-level contextual translation accuracy over the original out-of-the-box Whisper Small base weights.

Model Variant	Corpus-Level BLEU Score	Translation Alignment Behavior
📉 Original Base OpenAI Whisper Small	`5.55`	Struggles with colloquial variety filters; outputs pure Korean transcription or severe string hallucinations.
🚀 Fine-Tuned Burn-to-Win V2 (This Model)	`10.68`	Accurately interprets sports enthusiasm and intent, generating fluent human-like subbing choices.

Visual Ground-Truth Subtitle Inspection Sample

Ground Truth (Reference): how is he that fast ?
Fine-Tuned Model Output: why is he so fast ? (Highly contextual, fluent substitution)
Base OpenAI Whisper Output: we should find nobody (Complete hallucination)

How to Use & Deploy

To load this fine-tuned checkpoint directly from the Hugging Face Hub using the peft and transformers ecosystem, use the following code snippet:

python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig

peft_model_id = "YOUR_HF_USERNAME/YOUR_REPO_NAME"

# 1. Load base configuration and processor
config = PeftConfig.from_pretrained(peft_model_id)
processor = WhisperProcessor.from_pretrained(config.base_model_name_or_path, language="english", task="translate")

# 2. Initialize underlying base architecture 
base_model = WhisperForConditionalGeneration.from_pretrained(
    config.base_model_name_or_path, 
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto"
)

# 3. Inject fine-tuned adapter layers
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.eval()
print("🎉 Custom translation engine deployed successfully!")

Limitations & Best Practices

Audio Clutter: While the model handles high background noise and stadium cheering exceptionally well, rapid cross-talking during loud variety segments may cause missing caption links.
Long-Form Media: For video files over 30 minutes, it is highly recommended to run inference using stable-ts (Stable Whisper) with arguments chunk_size=30 and condition_on_previous_text=False to maintain constant timestamp stability and eliminate trailing token duplication bugs.

whisper-small-ko-to-en-v4-cross-attention

Get help setting up a custom Dedicated Endpoints.

README