Model Architecture
The model is based on the LlamaForCausalLM architecture with the following configuration:
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Total Parameters | 124,000,000 |
| Vocabulary Size | 32,000 |
| Hidden Size | 768 |
| Intermediate Size | 2,048 |
| Number of Hidden Layers | 12 |
| Number of Attention Heads | 12 |
| Number of Key-Value Heads | 12 (MHA, no GQA) |
| Max Sequence Length | 1,024 |
| RMS Norm Epsilon | 1e-5 |
| Tie Word Embeddings | False |
Training Configuration
Training was conducted using the Hugging Face Trainer API optimized for PyTorch XLA (PJRT) on TPU hardware.
Table with columns: Hyperparameter, Value| Hyperparameter | Value |
|---|
| Hardware | Kaggle TPU v5e-8 (8 accelerators) |
| Precision | bfloat16 (XLA_USE_BF16="1") |
| Optimizer | AdamW (adamw_torch_xla) |
| Learning Rate | 6e-4 (Peak) |
| LR Scheduler | Cosine |
| Warmup Steps | 1,000 |
| Weight Decay | 0.1 |
Dataset
- Dataset Repository:
rijuludar/fwedu-llama32k
- Format: Raw binary
uint16 token ID files (.bin)
- Total Tokens Processed: 9,940,492,288 (~9.94 Billion)
- Total Training Steps: 37,920
Training Metrics & Results
Training stabilized quickly after the warmup period. A minor loss spike was observed near step 22,500 after resumption from the 1st session but recovered within 500 steps.
- Initial Loss (Step 1): 10.5078
- Final Training Loss (Step 37,920): 3.0906
- Final Evaluation Loss (Step 37,920): 3.1370
- Final Gradient Norm: 0.0534
- Final Learning Rate: 1.086e-12
Visualizations
1. Training and Evaluation Loss

2. Learning Rate Schedule

3. Gradient Norm over Steps

Generation Samples
Below are raw, uncurated output samples generated by the model. As a 124M parameter model trained on roughly 10 billion tokens, it has successfully learned basic English grammar, syntax, and pattern continuation. However, it lacks advanced reasoning capabilities and frequently hallucinates factual information.
Sample 1: Basic Continuation
Prompt: The sun rises in the east and sets in the
Model: west. Its brightness is 94% of its magnitude, but it is always at a high level. In fact it’s pretty close to the equator with just one third of the Earth's diameter per square meter (2071 km). It is the longest known planet on earth that we see anywhere from about 6 billion years ago...
Sample 2: Pattern Recognition (with factual hallucination)
Prompt: The capital of France is Paris. The capital of Germany is Berlin. The capital of Japan is
Model: Moscow, the city of Tokyo and in the 19th century was a part of Poland. In addition to its wealth it also has an impressive history that dates back thousands of years...
Sample 3: Creative / Storytelling
Prompt: Once upon a time in a galaxy made of glass, there lived a creature who ate light. Every morning, the creature would
Model: be looking down at its head and thinking about what it looked like after they had finished reading to him, but not that much. That was how he heard his voice saying: "Oh, you are all going to eat." It's said that when I was an adult on my phone from 1930-20th century school, some people thought we were getting darker than others think...
Sample 4: Code/Markup Structure
Prompt: <html>\n<head>\n<title>My Website</title>
Model: The following HTML tags are used to indicate the name of this article. If you want to delete it, please click on them and press OK in the mouse button...
Benchmarks & Evaluation
To empirically evaluate the model's factual knowledge, common-sense reasoning, and linguistic fluency, we benchmarked slm-124M-llama32k-fwedu using EleutherAI's lm-evaluation-harness.
We compared our model against established, similarly sized open-source base models in the 120M–160M parameter range:
HuggingFaceTB/SmolLM-135M (Hugging Face's first-generation 135M base model).
HuggingFaceTB/SmolLM2-135M (Hugging Face's highly optimized second-generation 135M base model).
openai-community/gpt2 (The classic 124M parameter pre-trained base model).
facebook/opt-125m (Meta's 125M parameter pre-trained base model).
EleutherAI/pythia-160m (EleutherAI's 160M parameter pre-trained base model).
Unified Comparison Summary (0-Shot)
Table with columns: Model Name, Parameters, HellaSwag (acc_norm), SciQ (acc_norm), WikiText-2 (word_perplexity)| Model Name | Parameters | HellaSwag (acc_norm) | SciQ (acc_norm) | WikiText-2 (word_perplexity) |
|---|
| slm-124M-llama32k-fwedu (This One) | 124M | 27.51% | 59.70% | 88.01 |
| openai-community/gpt2 | 124M | 31.14% | 64.40% | 37.37 |
For HellaSwag & SciQ : Higher is Better
For Wikitext-2 : Lower is Better
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "rijuludar/slm-124M-llama32k-fwedu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "The process of photosynthesis involves"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))