Model Details
Model Description
This is a small GPT-2 model trained from scratch on Kannada text. It uses a custom BPE tokenizer also trained from scratch on the same data. The model can generate coherent Kannada text and produces useful representations for downstream tasks.
- Developed by: AbhiDS16
- Model type: GPT-2 (decoder-only transformer)
- Language: Kannada (kn)
- License: MIT
- Parameters: 31,626,240
- Context length: 512 tokens
- Vocabulary size: 12,000
- Trained from scratch: Yes (no pretrained initialization)
Model Sources
Uses
Direct Use
The model can be used for:
- Kannada text generation
- Extracting embeddings for downstream tasks (classification, clustering)
- Fine-tuning on task-specific Kannada datasets
- Studying low-resource language model training
Downstream Use
The model's frozen embeddings achieve 73.5% accuracy on Kannada sentiment classification with a simple logistic regression head — demonstrating transferable representations.
Out-of-Scope Use
- Chat/instruction-following (model is not instruction-tuned)
- Production systems requiring high factual accuracy
- Sensitive content generation without safeguards
Bias, Risks, and Limitations
- Small model size: 31.6M parameters limits factual knowledge and reasoning
- Repetition: Tends to repeat phrases in longer generations
- Training data bias: Web text (news, blogs) reflects biases and code-mixing present in online Kannada
- Not instruction-tuned: Raw causal LM — not suitable for chat/QA without fine-tuning
- Data recency: Training data from mC4 (2011–2022)
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("AbhiDS16/kannada-gpt2-32m")
tokenizer = AutoTokenizer.from_pretrained("AbhiDS16/kannada-gpt2-32m")
prompt = "ನಾನು ಇಂದು ಬೆಳಿಗ್ಗೆ"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=80,
temperature=0.7,
do_sample=True,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
CulturaX-Kn — 1.35M documents (~4GB) of Kannada web text from mC4. After filtering (Kannada script ratio ≥ 60%, deduplication, length filtering), 12.6M clean sentences were used for training.
Training Procedure
- Precision: fp16 mixed
- Batch size: 16 (effective 32 with gradient accumulation)
- Learning rate: 5e-4 with cosine decay and 1,000 step warmup
- Optimizer: AdamW (β₁=0.9, β₂=0.95, weight decay=0.01)
- Gradient clipping: 1.0
- Epochs: 3
- Total steps: 83,874
- Training tokens: ~463M
Speeds, Sizes, Times
- Hardware: NVIDIA RTX 5070 (8GB VRAM)
- Training time: 7 hours 16 minutes
- Model size on disk: ~126MB (safetensors)
- Throughput: ~3.2 steps/second
Evaluation
Perplexity
Table with columns: Metric, Value| Metric | Value |
|---|
| Validation loss | 3.4594 |
| Perplexity | 31.80 |
| Evaluation tokens | 4,626,944 |
Sentiment Classification
Table with columns: Metric, Value| Metric | Value |
|---|
| Method | Frozen LM + Logistic Regression |
| Accuracy | 73.5% |
| F1 (macro) | 0.735 |
Tokenizer Efficiency
Custom BPE tokenizer trained from scratch on Kannada text:
Table with columns: Tokenizer, Tokens/Word, Improvement| Tokenizer | Tokens/Word | Improvement |
|---|
| Our BPE | 1.91 | — |
| XLM-R | 2.43 | 21.5% |
| mBERT | 4.00 | 52.2% |
Environmental Impact
- Hardware: NVIDIA RTX 5070 (125W TDP under load)
- Hours used: ~7.3 hours
- Estimated carbon: ~0.35 kg CO2eq (assuming 0.4 kg/kWh grid average)
- Cloud provider: N/A (local desktop)
Technical Specifications
Model Architecture
- 8 transformer layers
- 512 hidden dimension
- 8 attention heads
- 2,048 feed-forward dimension
- GELU activation
- 0.1 dropout
Compute Infrastructure
- GPU: NVIDIA RTX 5070 (8GB VRAM)
- CPU: Intel Core Ultra 9 285H
- RAM: 32GB
Software
- Python 3.10
- PyTorch 2.10
- Transformers 4.x
- Datasets 3.x
- Tokenizers 0.19
Citation
@misc{kannada-gpt2-32m,
author = {AbhiDS16},
title = {Kannada GPT-2 Small: A From-Scratch Language Model for Kannada},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/AbhiDS16/kannada-gpt2-32m}},
note = {Trained entirely from scratch with custom BPE tokenizer}
}
Open an issue on GitHub: https://github.com/thorOdinson16/KanLM