AbhiDS16

kannada-gpt2-32m

Model Details

Model Description

This is a small GPT-2 model trained from scratch on Kannada text. It uses a custom BPE tokenizer also trained from scratch on the same data. The model can generate coherent Kannada text and produces useful representations for downstream tasks.

Developed by: AbhiDS16
Model type: GPT-2 (decoder-only transformer)
Language: Kannada (kn)
License: MIT
Parameters: 31,626,240
Context length: 512 tokens
Vocabulary size: 12,000
Trained from scratch: Yes (no pretrained initialization)

Model Sources

Repository: https://github.com/thorOdinson16/KanLM
Demo: Use the Quick Start code below

Uses

Direct Use

The model can be used for:

Kannada text generation
Extracting embeddings for downstream tasks (classification, clustering)
Fine-tuning on task-specific Kannada datasets
Studying low-resource language model training

Downstream Use

The model's frozen embeddings achieve 73.5% accuracy on Kannada sentiment classification with a simple logistic regression head — demonstrating transferable representations.

Out-of-Scope Use

Chat/instruction-following (model is not instruction-tuned)
Production systems requiring high factual accuracy
Sensitive content generation without safeguards

Bias, Risks, and Limitations

Small model size: 31.6M parameters limits factual knowledge and reasoning
Repetition: Tends to repeat phrases in longer generations
Training data bias: Web text (news, blogs) reflects biases and code-mixing present in online Kannada
Not instruction-tuned: Raw causal LM — not suitable for chat/QA without fine-tuning
Data recency: Training data from mC4 (2011–2022)

How to Get Started with the Model

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AbhiDS16/kannada-gpt2-32m")
tokenizer = AutoTokenizer.from_pretrained("AbhiDS16/kannada-gpt2-32m")

prompt = "ನಾನು ಇಂದು ಬೆಳಿಗ್ಗೆ"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
    pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

CulturaX-Kn — 1.35M documents (~4GB) of Kannada web text from mC4. After filtering (Kannada script ratio ≥ 60%, deduplication, length filtering), 12.6M clean sentences were used for training.

Training Procedure

Precision: fp16 mixed
Batch size: 16 (effective 32 with gradient accumulation)
Learning rate: 5e-4 with cosine decay and 1,000 step warmup
Optimizer: AdamW (β₁=0.9, β₂=0.95, weight decay=0.01)
Gradient clipping: 1.0
Epochs: 3
Total steps: 83,874
Training tokens: ~463M

Speeds, Sizes, Times

Hardware: NVIDIA RTX 5070 (8GB VRAM)
Training time: 7 hours 16 minutes
Model size on disk: ~126MB (safetensors)
Throughput: ~3.2 steps/second

Evaluation

Perplexity

Table with columns: Metric, Value
Metric	Value
Validation loss	3.4594
Perplexity	31.80
Evaluation tokens	4,626,944

Sentiment Classification

Table with columns: Metric, Value
Metric	Value
Method	Frozen LM + Logistic Regression
Accuracy	73.5%
F1 (macro)	0.735

Tokenizer Efficiency

Custom BPE tokenizer trained from scratch on Kannada text:

Table with columns: Tokenizer, Tokens/Word, Improvement
Tokenizer	Tokens/Word	Improvement
Our BPE	1.91	—
XLM-R	2.43	21.5%
mBERT	4.00	52.2%

Environmental Impact

Hardware: NVIDIA RTX 5070 (125W TDP under load)
Hours used: ~7.3 hours
Estimated carbon: ~0.35 kg CO2eq (assuming 0.4 kg/kWh grid average)
Cloud provider: N/A (local desktop)

Technical Specifications

Model Architecture

8 transformer layers
512 hidden dimension
8 attention heads
2,048 feed-forward dimension
GELU activation
0.1 dropout

Compute Infrastructure

GPU: NVIDIA RTX 5070 (8GB VRAM)
CPU: Intel Core Ultra 9 285H
RAM: 32GB

Software

Python 3.10
PyTorch 2.10
Transformers 4.x
Datasets 3.x
Tokenizers 0.19

Citation

bibtex
@misc{kannada-gpt2-32m,
  author = {AbhiDS16},
  title = {Kannada GPT-2 Small: A From-Scratch Language Model for Kannada},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/AbhiDS16/kannada-gpt2-32m}},
  note = {Trained entirely from scratch with custom BPE tokenizer}
}

Model Card Contact

Open an issue on GitHub: https://github.com/thorOdinson16/KanLM

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

AbhiDS16

Model Tree

Base

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Model Description

Developed by: AbhiDS16
Model type: GPT-2 (decoder-only transformer)
Language: Kannada (kn)
License: MIT
Parameters: 31,626,240
Context length: 512 tokens
Vocabulary size: 12,000
Trained from scratch: Yes (no pretrained initialization)

Model Sources

Repository: https://github.com/thorOdinson16/KanLM
Demo: Use the Quick Start code below

Uses

Direct Use

The model can be used for:

Kannada text generation
Extracting embeddings for downstream tasks (classification, clustering)
Fine-tuning on task-specific Kannada datasets
Studying low-resource language model training

Downstream Use

The model's frozen embeddings achieve 73.5% accuracy on Kannada sentiment classification with a simple logistic regression head — demonstrating transferable representations.

Out-of-Scope Use

Chat/instruction-following (model is not instruction-tuned)
Production systems requiring high factual accuracy
Sensitive content generation without safeguards

Bias, Risks, and Limitations

Small model size: 31.6M parameters limits factual knowledge and reasoning
Repetition: Tends to repeat phrases in longer generations
Training data bias: Web text (news, blogs) reflects biases and code-mixing present in online Kannada
Not instruction-tuned: Raw causal LM — not suitable for chat/QA without fine-tuning
Data recency: Training data from mC4 (2011–2022)

How to Get Started with the Model

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AbhiDS16/kannada-gpt2-32m")
tokenizer = AutoTokenizer.from_pretrained("AbhiDS16/kannada-gpt2-32m")

prompt = "ನಾನು ಇಂದು ಬೆಳಿಗ್ಗೆ"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
    pad_token_id=tokenizer.pad_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Training Procedure

Precision: fp16 mixed
Batch size: 16 (effective 32 with gradient accumulation)
Learning rate: 5e-4 with cosine decay and 1,000 step warmup
Optimizer: AdamW (β₁=0.9, β₂=0.95, weight decay=0.01)
Gradient clipping: 1.0
Epochs: 3
Total steps: 83,874
Training tokens: ~463M

Speeds, Sizes, Times

Hardware: NVIDIA RTX 5070 (8GB VRAM)
Training time: 7 hours 16 minutes
Model size on disk: ~126MB (safetensors)
Throughput: ~3.2 steps/second

Evaluation

Perplexity

Table with columns: Metric, Value
Metric	Value
Validation loss	3.4594
Perplexity	31.80
Evaluation tokens	4,626,944

Sentiment Classification

Table with columns: Metric, Value
Metric	Value
Method	Frozen LM + Logistic Regression
Accuracy	73.5%
F1 (macro)	0.735

Tokenizer Efficiency

Custom BPE tokenizer trained from scratch on Kannada text:

Table with columns: Tokenizer, Tokens/Word, Improvement
Tokenizer	Tokens/Word	Improvement
Our BPE	1.91	—
XLM-R	2.43	21.5%
mBERT	4.00	52.2%

Environmental Impact

Hardware: NVIDIA RTX 5070 (125W TDP under load)
Hours used: ~7.3 hours
Estimated carbon: ~0.35 kg CO2eq (assuming 0.4 kg/kWh grid average)
Cloud provider: N/A (local desktop)

Technical Specifications

Model Architecture

8 transformer layers
512 hidden dimension
8 attention heads
2,048 feed-forward dimension
GELU activation
0.1 dropout

Compute Infrastructure

GPU: NVIDIA RTX 5070 (8GB VRAM)
CPU: Intel Core Ultra 9 285H
RAM: 32GB

Software

Python 3.10
PyTorch 2.10
Transformers 4.x
Datasets 3.x
Tokenizers 0.19

Citation

bibtex
@misc{kannada-gpt2-32m,
  author = {AbhiDS16},
  title = {Kannada GPT-2 Small: A From-Scratch Language Model for Kannada},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/AbhiDS16/kannada-gpt2-32m}},
  note = {Trained entirely from scratch with custom BPE tokenizer}
}

Model Card Contact

Open an issue on GitHub: https://github.com/thorOdinson16/KanLM

kannada-gpt2-32m

README

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Speeds, Sizes, Times

Evaluation

Perplexity

Sentiment Classification

Tokenizer Efficiency

Environmental Impact

Technical Specifications

Model Architecture

Compute Infrastructure

Software

Citation

Model Card Contact

Explore FriendliAI today

README

Model Details

Model Description

Model Sources

Uses

Direct Use

Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Speeds, Sizes, Times

Evaluation

Perplexity

Sentiment Classification

Tokenizer Efficiency

Environmental Impact

Technical Specifications

Model Architecture

Compute Infrastructure

Software

Citation

Model Card Contact