aboalaa1472/whisper-quran-lora-v2 API & Inference Endpoint

📌 Model Overview

Table
Field	Details
Model Repo	`aboalaa1472/whisper-quran-lora-v2`
Base Weights	`naazimsnh02/whisper-large-v3-turbo-ar-quran`
Architecture	Whisper Large V3 Turbo (Weights Fully Merged)
Task	Automatic Speech Recognition (ASR) — Quranic Arabic
Language	Arabic (Classical / Quranic)
License	MIT

🎯 Project Motivation

Quranic recitation presents unique challenges for general-purpose ASR systems: precise Tajweed rules, elongations (Madd), and a phonetic richness distinct from Modern Standard Arabic.

While the initial version (v1) was trained using LoRA (Low-Rank Adaptation) and 8-bit quantization to fit modest hardware, this version (v2) represents the final engineering milestone: permanently merging those adapted weights back into the foundational model architecture. This eliminates any secondary dependency on PEFT libraries at inference time, reduces VRAM loading overhead, and guarantees maximal transcription throughput for real-world deployment in educational and memorization systems.

📂 Dataset

Table
Split	Source	Samples
Training	`tarteel-ai/everyayah` — `train`	5,000
Validation	`tarteel-ai/everyayah` — `validation`	500

The EveryAyah dataset contains high-quality recordings of Quranic verses by various reciters, making it an ideal resource for training recitation-aware ASR models.

⚙️ Core Methodology (v1 → v2)

The model weights were fused in native FP16 precision using the following formula:

$θ merged = θ base + r α (A \cdot B)$

Where $A$ and $B$ are the low-rank matrices learned during the 400-step fine-tuning phase ( $r = 32$ , $α = 64$ ). Merging these deltas ensures that the model executes as a single, unified computation graph with no PEFT overhead at inference time.

🔊 Audio Preprocessing & Augmentation

Decoding Pipeline

All audio files are decoded manually at 16 kHz using librosa.load(), converting raw bytes directly to float32 NumPy arrays. This approach bypasses the default torchcodec audio decoder (which exhibited compatibility issues with the installed PyTorch version) and ensures reproducible behaviour across environments.

Data Augmentation Strategy

To improve robustness to diverse recitation paces and tonal variations, two on-the-fly transformations were applied during training via the audiomentations library:

Table
Transform	Parameters	Probability	Purpose
PitchShift	±3 semitones	50%	Simulates pitch differences across male, female, and child reciters
TimeStretch	Rate: 0.85× – 1.15×	40%	Adapts the model to handle both rapid (Hadr) and deliberate (Tarteel) recitation paces

🚀 How to Run — Inference

Since this is a fully merged model, you do not need the peft library. Load it directly as a standard Whisper model.

1. Install Dependencies

bash
pip install transformers torch librosa accelerate

2. Load and Transcribe

python
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig

# ── Configuration ────────────────────────────────────────────────
MODEL_ID   = "aboalaa1472/whisper-quran-lora-v2"
AUDIO_PATH = "your_audio.wav"   # 16 kHz mono WAV recommended
# ─────────────────────────────────────────────────────────────────

# 1. Load processor and merged model
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

# 2. Attach stable generation config for Whisper Turbo architectures
model.generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v3-turbo")
model.generation_config.forced_decoder_ids = (
    processor.get_decoder_prompt_ids(language="arabic", task="transcribe")
)
model.eval()

# 3. Load and preprocess audio
audio_array, _ = librosa.load(AUDIO_PATH, sr=16000)

inputs = processor.feature_extractor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt",
).to(model.device, dtype=torch.float16)

# 4. Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        max_new_tokens=444,
        num_beams=1,
    )

transcription = processor.tokenizer.batch_decode(
    predicted_ids, skip_special_tokens=True
)[0].strip()

print("📖 Transcription:", transcription)

🏗️ Repository Structure

Unlike adapter repos, this repository contains the complete weight matrices and configuration files required for native execution:

markdown
aboalaa1472/whisper-quran-lora-v2/
├── config.json               # Main model architectural configuration
├── model.safetensors         # Full fused model weights (~1.6 GB)
├── generation_config.json    # Text generation parameters
├── preprocessor_config.json  # Audio feature extractor settings (16 kHz, Mel banks)
├── tokenizer_config.json     # Tokenizer execution behaviours
├── tokenizer.json            # Vocabulary and subword token mappings
└── README.md                 # This file

📖 Citation

If you use this model in your research or project, please cite:

bibtex
@misc{whisper-quran-lora-v2,
  author    = {aboalaa1472},
  title     = {Whisper Quranic ASR — Fully Merged Fine-Tuned Model (v2)},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/aboalaa1472/whisper-quran-lora-v2}
}

🤝 Acknowledgements

OpenAI Whisper for the foundational architecture.
Tarteel AI for the EveryAyah dataset.
Hugging Face Transformers for the seamless deployment ecosystem.
naazimsnh02 for the Quranic-adapted base model.

whisper-quran-lora-v2

Get help setting up a custom Dedicated Endpoints.

README