Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Description

  • Developed by: [Your Name / Username]
  • Model Type: Sequence-to-Sequence Generative Audio Transformer (Whisper)
  • Language(s): Audio Input: Korean (ko) -> Text Output: English (en)
  • Task: End-to-End Translation (task="translate")
  • Finetuned from model: openai/whisper-small
  • Quantization: 4-bit NormalFloat (NF4) via bitsandbytes

Training & Dataset Architecture

The model was trained on a robust 66-hour master dataset fusion matrix, balancing domain-specific variety show vocabulary with general conversational stability using a precise 70:30 ratio:

  1. Show-Centric Data (70%): 46 hours of clean broadcast audio slices paired with custom translated English subtitles mapping complex baseball jargon and variety show filler phrases.
  2. General Data (30%): 20 hours of conversational multi-domain Korean audio to stabilize English sentence structure and prevent domain-specific catastrophic forgetting, pulled via:
    • didiudom94/my-zeroth-audio-dataset-with-text (4,000 balanced rows)
    • Bingsu/KSS_Dataset (10,000 balanced rows)

LoRA Hyperparameters

  • Rank (r): 32
  • Alpha (α): 64
  • Target Modules: Core Attention Projection Vectors (["q_proj", "v_proj", "out_proj", "k_proj"])
  • Trainable Parameters: 7,077,888 (~2.84% of total weights)

Training Hyperparameters

  • Hardware Platform: NVIDIA A100 / L4 GPU Ecosystem
  • Effective Batch Size: 32 (per_device_train_batch_size=32, gradient_accumulation_steps=1)
  • Precision: Native bfloat16 (bf16=True, tf32=True)
  • Learning Rate: 3e-5 (Linear decay with 500 warmup steps)
  • Epochs: 2 full passes over the fused master dataset
  • Optimization Metric Matrix: Cross-Entropy Validation Loss (metric_for_best_model="loss")

Empirical Performance Benchmarks

Evaluation was run using a 1,000-row unseen evaluation slice drawn directly from the master test matrix. The model demonstrates a near-doubling of string-level contextual translation accuracy over the original out-of-the-box Whisper Small base weights.

Model VariantCorpus-Level BLEU ScoreTranslation Alignment Behavior
📉 Original Base OpenAI Whisper Small5.55Struggles with colloquial variety filters; outputs pure Korean transcription or severe string hallucinations.
🚀 Fine-Tuned Burn-to-Win V2 (This Model)10.68Accurately interprets sports enthusiasm and intent, generating fluent human-like subbing choices.

Visual Ground-Truth Subtitle Inspection Sample

  • Ground Truth (Reference): how is he that fast ?
  • Fine-Tuned Model Output: why is he so fast ? (Highly contextual, fluent substitution)
  • Base OpenAI Whisper Output: we should find nobody (Complete hallucination)

How to Use & Deploy

To load this fine-tuned checkpoint directly from the Hugging Face Hub using the peft and transformers ecosystem, use the following code snippet:

python

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel, PeftConfig
peft_model_id = "YOUR_HF_USERNAME/YOUR_REPO_NAME"
# 1. Load base configuration and processor
config = PeftConfig.from_pretrained(peft_model_id)
processor = WhisperProcessor.from_pretrained(config.base_model_name_or_path, language="english", task="translate")
# 2. Initialize underlying base architecture
base_model = WhisperForConditionalGeneration.from_pretrained(
config.base_model_name_or_path,
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
device_map="auto"
)
# 3. Inject fine-tuned adapter layers
model = PeftModel.from_pretrained(base_model, peft_model_id)
model.eval()
print("🎉 Custom translation engine deployed successfully!")

Limitations & Best Practices

  • Audio Clutter: While the model handles high background noise and stadium cheering exceptionally well, rapid cross-talking during loud variety segments may cause missing caption links.
  • Long-Form Media: For video files over 30 minutes, it is highly recommended to run inference using stable-ts (Stable Whisper) with arguments chunk_size=30 and condition_on_previous_text=False to maintain constant timestamp stability and eliminate trailing token duplication bugs.

Model provider

didiudom94

didiudom94

Model tree

Base

openai/whisper-small

Adapter

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today