rijuludar

slm-124M-llama32k-fwedu

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Architecture

The model is based on the LlamaForCausalLM architecture with the following configuration:

Table with columns: Parameter, Value
Parameter	Value
Total Parameters	124,000,000
Vocabulary Size	32,000
Hidden Size	768
Intermediate Size	2,048
Number of Hidden Layers	12
Number of Attention Heads	12
Number of Key-Value Heads	12 (MHA, no GQA)
Max Sequence Length	1,024
RMS Norm Epsilon	1e-5
Tie Word Embeddings	False

Training Configuration

Training was conducted using the Hugging Face Trainer API optimized for PyTorch XLA (PJRT) on TPU hardware.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Hardware	Kaggle TPU v5e-8 (8 accelerators)
Precision	bfloat16 (`XLA_USE_BF16="1"`)
Optimizer	AdamW (`adamw_torch_xla`)
Learning Rate	6e-4 (Peak)
LR Scheduler	Cosine
Warmup Steps	1,000
Weight Decay	0.1

Dataset

Dataset Repository: rijuludar/fwedu-llama32k
Format: Raw binary uint16 token ID files (.bin)
Total Tokens Processed: 9,940,492,288 (~9.94 Billion)
Total Training Steps: 37,920

Training Metrics & Results

Training stabilized quickly after the warmup period. A minor loss spike was observed near step 22,500 after resumption from the 1st session but recovered within 500 steps.

Initial Loss (Step 1): 10.5078
Final Training Loss (Step 37,920): 3.0906
Final Evaluation Loss (Step 37,920): 3.1370
Final Gradient Norm: 0.0534
Final Learning Rate: 1.086e-12

Visualizations

1. Training and Evaluation Loss

2. Learning Rate Schedule

3. Gradient Norm over Steps

Generation Samples

Below are raw, uncurated output samples generated by the model. As a 124M parameter model trained on roughly 10 billion tokens, it has successfully learned basic English grammar, syntax, and pattern continuation. However, it lacks advanced reasoning capabilities and frequently hallucinates factual information.

Sample 1: Basic Continuation

Prompt: The sun rises in the east and sets in the

Model: west. Its brightness is 94% of its magnitude, but it is always at a high level. In fact it’s pretty close to the equator with just one third of the Earth's diameter per square meter (2071 km). It is the longest known planet on earth that we see anywhere from about 6 billion years ago...

Sample 2: Pattern Recognition (with factual hallucination)

Prompt: The capital of France is Paris. The capital of Germany is Berlin. The capital of Japan is

Model: Moscow, the city of Tokyo and in the 19th century was a part of Poland. In addition to its wealth it also has an impressive history that dates back thousands of years...

Sample 3: Creative / Storytelling

Prompt: Once upon a time in a galaxy made of glass, there lived a creature who ate light. Every morning, the creature would

Model: be looking down at its head and thinking about what it looked like after they had finished reading to him, but not that much. That was how he heard his voice saying: "Oh, you are all going to eat." It's said that when I was an adult on my phone from 1930-20th century school, some people thought we were getting darker than others think...

Sample 4: Code/Markup Structure

Prompt: <html>\n<head>\n<title>My Website</title>

Model: The following HTML tags are used to indicate the name of this article. If you want to delete it, please click on them and press OK in the mouse button...

Benchmarks & Evaluation

To empirically evaluate the model's factual knowledge, common-sense reasoning, and linguistic fluency, we benchmarked slm-124M-llama32k-fwedu using EleutherAI's lm-evaluation-harness.

We compared our model against established, similarly sized open-source base models in the 120M–160M parameter range:

HuggingFaceTB/SmolLM-135M (Hugging Face's first-generation 135M base model).
HuggingFaceTB/SmolLM2-135M (Hugging Face's highly optimized second-generation 135M base model).
openai-community/gpt2 (The classic 124M parameter pre-trained base model).
facebook/opt-125m (Meta's 125M parameter pre-trained base model).
EleutherAI/pythia-160m (EleutherAI's 160M parameter pre-trained base model).

Unified Comparison Summary (0-Shot)

Table with columns: Model Name, Parameters, HellaSwag (acc_norm), SciQ (acc_norm), WikiText-2 (word_perplexity)
Model Name	Parameters	HellaSwag (`acc_norm`)	SciQ (`acc_norm`)	WikiText-2 (`word_perplexity`)
slm-124M-llama32k-fwedu (This One)	124M	27.51%	59.70%	88.01
openai-community/gpt2	124M	31.14%	64.40%	37.37

For HellaSwag & SciQ : Higher is Better

For Wikitext-2 : Lower is Better

Usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "rijuludar/slm-124M-llama32k-fwedu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "The process of photosynthesis involves"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model provider

rijuludar

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card