aboalaa1472
whisper-quran-lora-v2
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mit📌 Model Overview
| Field | Details |
|---|---|
| Model Repo | aboalaa1472/whisper-quran-lora-v2 |
| Base Weights | naazimsnh02/whisper-large-v3-turbo-ar-quran |
| Architecture | Whisper Large V3 Turbo (Weights Fully Merged) |
| Task | Automatic Speech Recognition (ASR) — Quranic Arabic |
| Language | Arabic (Classical / Quranic) |
| License | MIT |
🎯 Project Motivation
Quranic recitation presents unique challenges for general-purpose ASR systems: precise Tajweed rules, elongations (Madd), and a phonetic richness distinct from Modern Standard Arabic.
While the initial version (v1) was trained using LoRA (Low-Rank Adaptation) and 8-bit quantization to fit modest hardware, this version (v2) represents the final engineering milestone: permanently merging those adapted weights back into the foundational model architecture. This eliminates any secondary dependency on PEFT libraries at inference time, reduces VRAM loading overhead, and guarantees maximal transcription throughput for real-world deployment in educational and memorization systems.
📂 Dataset
| Split | Source | Samples |
|---|---|---|
| Training | tarteel-ai/everyayah — train | 5,000 |
| Validation | tarteel-ai/everyayah — validation | 500 |
The EveryAyah dataset contains high-quality recordings of Quranic verses by various reciters, making it an ideal resource for training recitation-aware ASR models.
⚙️ Core Methodology (v1 → v2)
The model weights were fused in native FP16 precision using the following formula:
θmerged=θbase+rα(A⋅B)
Where A and B are the low-rank matrices learned during the 400-step fine-tuning phase (r=32, α=64). Merging these deltas ensures that the model executes as a single, unified computation graph with no PEFT overhead at inference time.
🔊 Audio Preprocessing & Augmentation
Decoding Pipeline
All audio files are decoded manually at 16 kHz using librosa.load(), converting raw bytes directly to float32 NumPy arrays. This approach bypasses the default torchcodec audio decoder (which exhibited compatibility issues with the installed PyTorch version) and ensures reproducible behaviour across environments.
Data Augmentation Strategy
To improve robustness to diverse recitation paces and tonal variations, two on-the-fly transformations were applied during training via the audiomentations library:
| Transform | Parameters | Probability | Purpose |
|---|---|---|---|
| PitchShift | ±3 semitones | 50% | Simulates pitch differences across male, female, and child reciters |
| TimeStretch | Rate: 0.85× – 1.15× | 40% | Adapts the model to handle both rapid (Hadr) and deliberate (Tarteel) recitation paces |
🚀 How to Run — Inference
Since this is a fully merged model, you do not need the peft library. Load it directly as a standard Whisper model.
1. Install Dependencies
bash
pip install transformers torch librosa accelerate
2. Load and Transcribe
python
import torchimport librosafrom transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig# ── Configuration ────────────────────────────────────────────────MODEL_ID = "aboalaa1472/whisper-quran-lora-v2"AUDIO_PATH = "your_audio.wav" # 16 kHz mono WAV recommended# ─────────────────────────────────────────────────────────────────# 1. Load processor and merged modelprocessor = WhisperProcessor.from_pretrained(MODEL_ID)model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID,torch_dtype=torch.float16,device_map="auto",)# 2. Attach stable generation config for Whisper Turbo architecturesmodel.generation_config = GenerationConfig.from_pretrained("openai/whisper-large-v3-turbo")model.generation_config.forced_decoder_ids = (processor.get_decoder_prompt_ids(language="arabic", task="transcribe"))model.eval()# 3. Load and preprocess audioaudio_array, _ = librosa.load(AUDIO_PATH, sr=16000)inputs = processor.feature_extractor(audio_array,sampling_rate=16000,return_tensors="pt",).to(model.device, dtype=torch.float16)# 4. Generate transcriptionwith torch.no_grad():predicted_ids = model.generate(inputs.input_features,max_new_tokens=444,num_beams=1,)transcription = processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True)[0].strip()print("📖 Transcription:", transcription)
🏗️ Repository Structure
Unlike adapter repos, this repository contains the complete weight matrices and configuration files required for native execution:
markdown
aboalaa1472/whisper-quran-lora-v2/├── config.json # Main model architectural configuration├── model.safetensors # Full fused model weights (~1.6 GB)├── generation_config.json # Text generation parameters├── preprocessor_config.json # Audio feature extractor settings (16 kHz, Mel banks)├── tokenizer_config.json # Tokenizer execution behaviours├── tokenizer.json # Vocabulary and subword token mappings└── README.md # This file
📖 Citation
If you use this model in your research or project, please cite:
bibtex
@misc{whisper-quran-lora-v2,author = {aboalaa1472},title = {Whisper Quranic ASR — Fully Merged Fine-Tuned Model (v2)},year = {2026},publisher = {HuggingFace},url = {https://huggingface.co/aboalaa1472/whisper-quran-lora-v2}}
🤝 Acknowledgements
- OpenAI Whisper for the foundational architecture.
- Tarteel AI for the EveryAyah dataset.
- Hugging Face Transformers for the seamless deployment ecosystem.
- naazimsnh02 for the Quranic-adapted base model.
Model provider
aboalaa1472
Model tree
Base
naazimsnh02/whisper-large-v3-turbo-ar-quran
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information