The model generates the Modern Standard Arabic translation autoregressively.
✨ Key Features
- Input: Egyptian Arabic slang/dialect
- Output: Modern Standard Arabic (MSA)
- Architecture: GPT-2 style decoder-only transformer
- Tokenizer: BPE tokenizer with 64k vocabulary
- Context length: 1024 tokens
- Language: Arabic
⚙️ Training Configuration
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Batch size | 8 (effective 32) |
| Learning rate | 5e-5 |
| Scheduler | Cosine |
| Warmup | 10% |
| Gradient clipping | 1.0 |
🎛️ Inference Configuration
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Temperature | 0.7 |
| Top-k | 50 |
| Top-p | 0.92 |
| Repetition penalty | 1.3 |
Table with columns: Metric, Base AraGPT-2, SlangGPT| Metric | Base AraGPT-2 | SlangGPT |
|---|
| chrF | 10.62 | 29.08 |
| BLEU | 0.02 | 6.63 |
| chrF Improvement | — | +18.46 (+173%) |
Metric Notes
- chrF measures character n-gram overlap.
- BLEU measures word n-gram precision.
🚀 Usage
1. Install Dependencies
pip install transformers torch
2. Load Model and Tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "AdhamAshraf/SlangGPT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
model.eval()
3. Translation Function
def translate(egyptian_text):
prompt = f"dialect: {egyptian_text.strip()} ↔ msa:"
inputs = tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=64
)
inputs = {
k: v.to(model.device)
for k, v in inputs.items()
}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=64,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.92,
repetition_penalty=1.3,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
full = tokenizer.decode(
outputs[0],
skip_special_tokens=True
)
if "msa:" in full:
return full.split("msa:")[-1].strip()
return full
4. Example Usage
print(translate("يلا فين؟"))
print(translate("إنت رايح فين؟"))
print(translate("عايز اكل"))
🌐 Interactive Web App
Try the live demo here:
https://huggingface.co/spaces/AdhamAshraf/SlangGPT
The Space allows users to:
- Translate Egyptian Arabic to MSA
- Submit feedback
- Rate translation quality
- Help improve future versions of SlangGPT
📊 Training Dataset
SlangGPT was fine-tuned using:
AdhamAshraf/egyptian-2-arabic
Dataset statistics:
Table with columns: Property, Value| Property | Value |
|---|
| Total samples | 18,250 |
| Format | Parallel Egyptian ↔ MSA |
| Train split | 80% |
| Validation split | 10% |
| Test split | 10% |
Preprocessing Steps
- Diacritic removal
- Punctuation normalization
- English text filtering
The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.
🧪 Evaluation & Feedback
The model was evaluated using:
User feedback collected through the Gradio Space is publicly stored in:
https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset
This feedback dataset supports:
- RLHF research
- Translation verification
- Reward model training
- Error analysis
📜 License
This project is released under the MIT License.
Free for academic and commercial use with attribution.
🙏 Acknowledgements
- AraGPT-2 by Antoun et al. (2021)
- Stanford CS224N framework and educational materials
- The Arabic NLP open-source community
📚 Citation
@software{slanggpt2026,
author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
year = {2026},
url = {https://github.com/adhamashraf7788/SlangGPT}
}
@dataset{egyptian_2_arabic,
author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
title = {Egyptian Arabic Slang to Formal Arabic Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
}
❓ Questions & Issues
For bugs, issues, or feature requests:
https://github.com/adhamashraf7788/SlangGPT/issues