aboalaa1472

whisper-quran-lora-v1

📌 Model Overview

Table with columns: Field, Details
Field	Details
Adapter Repo	`aboalaa1472/whisper-quran-lora-v1`
Base Model	`naazimsnh02/whisper-large-v3-turbo-ar-quran`
Architecture	Whisper Large V3 Turbo + LoRA (PEFT)
Task	Automatic Speech Recognition (ASR) — Quranic Arabic
Language	Arabic (Classical / Quranic)
License	MIT

🎯 Project Motivation

Quranic recitation presents unique challenges for general-purpose ASR systems: precise tajweed rules, elongations (madd), and a phonetic richness distinct from Modern Standard Arabic. This project addresses those challenges by fine-tuning a specialized Whisper variant using LoRA, a parameter-efficient technique that adapts large pre-trained models with a fraction of the trainable parameters — making it feasible to train on modest hardware without sacrificing quality.

This work was conducted as part of a Computer Engineering Graduation Project with the goal of improving accessibility tools for Quranic recitation evaluation, memorization aids, and educational platforms.

📂 Dataset

Table with columns: Split, Source, Samples
Split	Source	Samples
Training	`tarteel-ai/everyayah` — `train`	5,000
Validation	`tarteel-ai/everyayah` — `validation`	500

The EveryAyah dataset contains high-quality recordings of Quranic verses by various reciters, making it an ideal resource for training recitation-aware ASR models.

⚙️ Training Configuration

LoRA Hyperparameters

Table with columns: Parameter, Value
Parameter	Value
Rank (`r`)	32
LoRA Alpha (`lora_alpha`)	64
Target Modules	`q_proj`, `v_proj`
LoRA Dropout	0.05
Bias	none

Training Hyperparameters

Table with columns: Parameter, Value
Parameter	Value
Max Steps	400
Per-device Batch Size	4
Gradient Accumulation Steps	4
Effective Batch Size	16
Learning Rate	1e-5
Warmup Steps	50
Mixed Precision	FP16
Quantization	8-bit (BitsAndBytes)
Evaluation Strategy

🔊 Audio Preprocessing & Augmentation

A deliberate decision was made to bypass the default torchcodec audio decoder due to compatibility constraints with the installed PyTorch version. Instead, a custom pipeline using librosa was implemented for robust, format-agnostic audio loading.

Decoding Pipeline

All audio files are decoded manually at 16 kHz using librosa.load(), converting raw bytes directly to float32 numpy arrays. This approach avoids dependency on the system codec stack and ensures reproducible behaviour across environments.

Data Augmentation Strategy

To improve the model's robustness to diverse recitation styles, microphone qualities, and recording conditions, two augmentation transforms were applied on-the-fly during training using the audiomentations library:

Table with columns: Transform, Parameters, Probability
Transform	Parameters	Probability
PitchShift	±3 semitones	50%
TimeStretch	Rate: 0.85 × – 1.15 ×	40%

PitchShift simulates tonal variation between reciters (children, women, men) and exposes the model to pitch-shifted recitations without altering the linguistic content.
TimeStretch teaches the model to handle both rapid and deliberate recitation paces, improving generalisation across tajweed styles.

Augmentation is applied after decoding and before feature extraction, ensuring the mel spectrogram computed by the Whisper processor reflects the augmented signal.

📈 Training Results

Evaluation loss was monitored every 100 steps on the 500-sample validation split:

Table with columns: Step, Epoch, Eval Loss
Step	Epoch	Eval Loss
100	0.32	0.3435
200	0.64	0.3064
300	0.96	0.3014
400	1.28	0.3029

The model converged quickly, with eval loss dropping from 0.343 → 0.301 — an ~12.3% improvement — indicating successful adaptation to Quranic speech patterns within a single epoch of training data.

🚀 How to Run — Inference

1. Install Dependencies

bash
pip install transformers peft torch librosa bitsandbytes accelerate

2. Load the Adapter and Transcribe

python
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

# ── Configuration ────────────────────────────────────────────────
BASE_MODEL_ID  = "naazimsnh02/whisper-large-v3-turbo-ar-quran"
ADAPTER_REPO   = "aboalaa1472/whisper-quran-lora-v1"
AUDIO_PATH     = "your_audio.wav"   # 16 kHz mono WAV recommended
# ─────────────────────────────────────────────────────────────────

processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID)

base_model = WhisperForConditionalGeneration.from_pretrained(
    BASE_MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model.eval()

# Load and preprocess audio
audio_array, _ = librosa.load(AUDIO_PATH, sr=16000)

inputs = processor.feature_extractor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="ar",
        task="transcribe",
    )

transcription = processor.tokenizer.batch_decode(
    predicted_ids, skip_special_tokens=True
)[0]

print("📖 Transcription:", transcription)

Note: The adapter contains only the LoRA weight deltas (~tens of MBs). The base model (~3 GB) is downloaded separately from its own HuggingFace repository.

🏗️ Repository Structure

markdown
aboalaa1472/whisper-quran-lora-v1/
├── adapter_config.json        # LoRA configuration (r, alpha, target_modules …)
├── adapter_model.safetensors  # Trained LoRA weight deltas
└── README.md                  # This file

🔬 Technical Notes

Why LoRA? Full fine-tuning of Whisper Large V3 Turbo requires ~12 GB of GPU VRAM for weights alone. With LoRA r=32 applied to q_proj and v_proj, the number of trainable parameters is reduced by over 99%, enabling training on a single consumer GPU.
Why 8-bit Quantization? Combined with LoRA, 8-bit quantization via BitsAndBytes further reduces the memory footprint of the frozen base model, making the setup accessible on GPUs with 16 GB VRAM.
Label Masking: Padding tokens in the target sequence are masked with -100 before computing the cross-entropy loss, preventing the model from learning to predict padding.

📖 Citation

If you use this model in your research or project, please cite the base model and dataset:

bibtex
@misc{whisper-quran-lora-v1,
  author    = {aboalaa1472},
  title     = {Whisper Quranic ASR — LoRA Fine-Tuned Adapter},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/aboalaa1472/whisper-quran-lora-v1}
}

🤝 Acknowledgements

OpenAI Whisper for the foundational architecture.
Tarteel AI for the EveryAyah dataset.
Hugging Face PEFT for the LoRA implementation.
BitsAndBytes for efficient quantization.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

aboalaa1472

Model Tree

Base

naazimsnh02/whisper-large-v3-turbo-ar-quran

Adapter

this model

Input Modalities

Audio

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer