📌 Model Overview
Table with columns: Field, Details| Field | Details |
|---|
| Adapter Repo | aboalaa1472/whisper-quran-lora-v1 |
| Base Model | naazimsnh02/whisper-large-v3-turbo-ar-quran |
| Architecture | Whisper Large V3 Turbo + LoRA (PEFT) |
| Task | Automatic Speech Recognition (ASR) — Quranic Arabic |
| Language | Arabic (Classical / Quranic) |
| License | MIT |
🎯 Project Motivation
Quranic recitation presents unique challenges for general-purpose ASR systems: precise tajweed rules, elongations (madd), and a phonetic richness distinct from Modern Standard Arabic. This project addresses those challenges by fine-tuning a specialized Whisper variant using LoRA, a parameter-efficient technique that adapts large pre-trained models with a fraction of the trainable parameters — making it feasible to train on modest hardware without sacrificing quality.
This work was conducted as part of a Computer Engineering Graduation Project with the goal of improving accessibility tools for Quranic recitation evaluation, memorization aids, and educational platforms.
📂 Dataset
The EveryAyah dataset contains high-quality recordings of Quranic verses by various reciters, making it an ideal resource for training recitation-aware ASR models.
⚙️ Training Configuration
LoRA Hyperparameters
Table with columns: Parameter, Value| Parameter | Value |
|---|
Rank (r) | 32 |
LoRA Alpha (lora_alpha) | 64 |
| Target Modules | q_proj, v_proj |
| LoRA Dropout | 0.05 |
| Bias | none |
Training Hyperparameters
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Max Steps | 400 |
| Per-device Batch Size | 4 |
| Gradient Accumulation Steps | 4 |
| Effective Batch Size | 16 |
| Learning Rate | 1e-5 |
| Warmup Steps | 50 |
| Mixed Precision | FP16 |
| Quantization | 8-bit (BitsAndBytes) |
| Evaluation Strategy |
🔊 Audio Preprocessing & Augmentation
A deliberate decision was made to bypass the default torchcodec audio decoder due to compatibility constraints with the installed PyTorch version. Instead, a custom pipeline using librosa was implemented for robust, format-agnostic audio loading.
Decoding Pipeline
All audio files are decoded manually at 16 kHz using librosa.load(), converting raw bytes directly to float32 numpy arrays. This approach avoids dependency on the system codec stack and ensures reproducible behaviour across environments.
Data Augmentation Strategy
To improve the model's robustness to diverse recitation styles, microphone qualities, and recording conditions, two augmentation transforms were applied on-the-fly during training using the audiomentations library:
Table with columns: Transform, Parameters, Probability| Transform | Parameters | Probability |
|---|
| PitchShift | ±3 semitones | 50% |
| TimeStretch | Rate: 0.85 × – 1.15 × | 40% |
- PitchShift simulates tonal variation between reciters (children, women, men) and exposes the model to pitch-shifted recitations without altering the linguistic content.
- TimeStretch teaches the model to handle both rapid and deliberate recitation paces, improving generalisation across tajweed styles.
Augmentation is applied after decoding and before feature extraction, ensuring the mel spectrogram computed by the Whisper processor reflects the augmented signal.
📈 Training Results
Evaluation loss was monitored every 100 steps on the 500-sample validation split:
Table with columns: Step, Epoch, Eval Loss| Step | Epoch | Eval Loss |
|---|
| 100 | 0.32 | 0.3435 |
| 200 | 0.64 | 0.3064 |
| 300 | 0.96 | 0.3014 |
| 400 | 1.28 | 0.3029 |
The model converged quickly, with eval loss dropping from 0.343 → 0.301 — an ~12.3% improvement — indicating successful adaptation to Quranic speech patterns within a single epoch of training data.
🚀 How to Run — Inference
1. Install Dependencies
pip install transformers peft torch librosa bitsandbytes accelerate
2. Load the Adapter and Transcribe
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
BASE_MODEL_ID = "naazimsnh02/whisper-large-v3-turbo-ar-quran"
ADAPTER_REPO = "aboalaa1472/whisper-quran-lora-v1"
AUDIO_PATH = "your_audio.wav"
processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID)
base_model = WhisperForConditionalGeneration.from_pretrained(
BASE_MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model.eval()
audio_array, _ = librosa.load(AUDIO_PATH, sr=16000)
inputs = processor.feature_extractor(
audio_array,
sampling_rate=16000,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
language="ar",
task="transcribe",
)
transcription = processor.tokenizer.batch_decode(
predicted_ids, skip_special_tokens=True
)[0]
print("📖 Transcription:", transcription)
Note: The adapter contains only the LoRA weight deltas (~tens of MBs). The base model (~3 GB) is downloaded separately from its own HuggingFace repository.
🏗️ Repository Structure
aboalaa1472/whisper-quran-lora-v1/
├── adapter_config.json # LoRA configuration (r, alpha, target_modules …)
├── adapter_model.safetensors # Trained LoRA weight deltas
└── README.md # This file
🔬 Technical Notes
- Why LoRA? Full fine-tuning of Whisper Large V3 Turbo requires ~12 GB of GPU VRAM for weights alone. With LoRA
r=32 applied to q_proj and v_proj, the number of trainable parameters is reduced by over 99%, enabling training on a single consumer GPU.
- Why 8-bit Quantization? Combined with LoRA, 8-bit quantization via BitsAndBytes further reduces the memory footprint of the frozen base model, making the setup accessible on GPUs with 16 GB VRAM.
- Label Masking: Padding tokens in the target sequence are masked with
-100 before computing the cross-entropy loss, preventing the model from learning to predict padding.
📖 Citation
If you use this model in your research or project, please cite the base model and dataset:
@misc{whisper-quran-lora-v1,
author = {aboalaa1472},
title = {Whisper Quranic ASR — LoRA Fine-Tuned Adapter},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/aboalaa1472/whisper-quran-lora-v1}
}
🤝 Acknowledgements