aboalaa1472

whisper-quran-lora-v2

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

📌 Model Overview

Table
FieldDetails
Model Repoaboalaa1472/whisper-quran-lora-v2
Base Weightsnaazimsnh02/whisper-large-v3-turbo-ar-quran
ArchitectureWhisper Large V3 Turbo (Weights Fully Merged)
TaskAutomatic Speech Recognition (ASR) — Quranic Arabic
LanguageArabic (Classical / Quranic)
LicenseMIT

🎯 Project Motivation

Quranic recitation presents unique challenges for general-purpose ASR systems: precise Tajweed rules, elongations (Madd), and a phonetic richness distinct from Modern Standard Arabic.

While the initial version (v1) was trained using LoRA (Low-Rank Adaptation) and 8-bit quantization to fit modest hardware, this version (v2) represents the final engineering milestone: permanently merging those adapted weights back into the foundational model architecture. This eliminates any secondary dependency on PEFT libraries at inference time, reduces VRAM loading overhead, and guarantees maximal transcription throughput for real-world deployment in educational and memorization systems.


📂 Dataset

Table
SplitSourceSamples
Trainingtarteel-ai/everyayahtrain5,000
Validationtarteel-ai/everyayahvalidation500

The EveryAyah dataset contains high-quality recordings of Quranic verses by various reciters, making it an ideal resource for training recitation-aware ASR models.


⚙️ Core Methodology (v1 → v2)

The model weights were fused in native FP16 precision using the following formula:

θmerged=θbase+rα(AB)

Where A and B are the low-rank matrices learned during the 400-step fine-tuning phase (r=32, α=64). Merging these deltas ensures that the model executes as a single, unified computation graph with no PEFT overhead at inference time.


🔊 Audio Preprocessing & Augmentation

Decoding Pipeline

All audio files are decoded manually at 16 kHz using librosa.load(), converting raw bytes directly to float32 NumPy arrays. This approach bypasses the default torchcodec audio decoder (which exhibited compatibility issues with the installed PyTorch version) and ensures reproducible behaviour across environments.

Data Augmentation Strategy

To improve robustness to diverse recitation paces and tonal variations, two on-the-fly transformations were applied during training via the audiomentations library:

Table
TransformParametersProbabilityPurpose
PitchShift±3 semitones50%Simulates pitch differences across male, female, and child reciters
TimeStretchRate: 0.85× – 1.15×40%Adapts the model to handle both rapid (Hadr) and deliberate (Tarteel) recitation paces

🚀 How to Run — Inference

Since this is a fully merged model, you do not need the peft library. Load it directly as a standard Whisper model.

1. Install Dependencies

bash

pip install transformers torch librosa accelerate

2. Load and Transcribe

python

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig
# ── Configuration ────────────────────────────────────────────────
MODEL_ID = "aboalaa1472/whisper-quran-lora-v2"
AUDIO_PATH = "your_audio.wav" # 16 kHz mono WAV recommended
# ─────────────────────────────────────────────────────────────────
# 1. Load processor and merged model
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
# 2. Attach stable generation config for Whisper Turbo architectures
model.generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v3-turbo")
model.generation_config.forced_decoder_ids = (
processor.get_decoder_prompt_ids(language="arabic", task="transcribe")
)
model.eval()
# 3. Load and preprocess audio
audio_array, _ = librosa.load(AUDIO_PATH, sr=16000)
inputs = processor.feature_extractor(
audio_array,
sampling_rate=16000,
return_tensors="pt",
).to(model.device, dtype=torch.float16)
# 4. Generate transcription
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
max_new_tokens=444,
num_beams=1,
)
transcription = processor.tokenizer.batch_decode(
predicted_ids, skip_special_tokens=True
)[0].strip()
print("📖 Transcription:", transcription)

🏗️ Repository Structure

Unlike adapter repos, this repository contains the complete weight matrices and configuration files required for native execution:

markdown

aboalaa1472/whisper-quran-lora-v2/
├── config.json # Main model architectural configuration
├── model.safetensors # Full fused model weights (~1.6 GB)
├── generation_config.json # Text generation parameters
├── preprocessor_config.json # Audio feature extractor settings (16 kHz, Mel banks)
├── tokenizer_config.json # Tokenizer execution behaviours
├── tokenizer.json # Vocabulary and subword token mappings
└── README.md # This file

📖 Citation

If you use this model in your research or project, please cite:

bibtex

@misc{whisper-quran-lora-v2,
author = {aboalaa1472},
title = {Whisper Quranic ASR — Fully Merged Fine-Tuned Model (v2)},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/aboalaa1472/whisper-quran-lora-v2}
}

🤝 Acknowledgements

Model provider

aboalaa1472

Model tree

Base

naazimsnh02/whisper-large-v3-turbo-ar-quran

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today