Model Architecture
The structural parameters are inherited from the pre-trained base model, with the token embeddings expanded to accommodate three new ChatML control tokens.
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Total Parameters | 134,110,464 (~134M) |
| Vocabulary Size | 32,003 (32,000 base + 3 ChatML tokens) |
| Hidden Size | 768 |
| Intermediate Size | 2,048 |
| Number of Hidden Layers | 12 |
| Number of Attention Heads | 12 |
| Number of Key-Value Heads | 12 (MHA) |
| Max Sequence Length (Base) | 1,024 tokens |
| Architecture | LlamaForCausalLM |
ChatML Special Tokens
Table with columns: Token, Role, Token ID| Token | Role | Token ID |
|---|
<|im_start|> | Turn-start delimiter | 32000 |
<|im_end|> | Turn-end delimiter (eos_token) | 32001 |
<|pad|> | Explicit padding token | 32002 |
SFT Training Configuration
Training was conducted in a fully distributed environment using PyTorch DDP across multiple GPUs.
Table with columns: Hyperparameter, Value| Hyperparameter | Value |
|---|
| Hardware | Kaggle GPU Tesla T4 × 2 (32 GB aggregate VRAM) |
| Precision | fp16 |
| Optimizer | AdamW (adamw_torch) |
| Learning Rate | 3e-5 (Peak) |
| LR Scheduler | Cosine with Warmup |
| Warmup Ratio | 0.03 |
| Weight Decay | 0.01 |
|
Dataset Processing & Filtration
System messages were stripped from all conversations prior to tokenization to conserve context window budget. Conversations exceeding the 1,024-token base maximum length after ChatML formatting were discarded entirely; no silent truncation was applied, preserving complete conversational turns.
Training Metrics & Visualizations
The training run completed successfully over the full 21,360 steps with highly stable convergence. The validation loss and token accuracy tracked hand-in-hand with training metrics, showing excellent generalization with no signs of overfitting.
- Final Training Loss (Step 21,360): 1.1122
- Final Evaluation Loss (Step 21,360): 1.4007
- Final Evaluation Token Accuracy (Step 21,360): 67.92%
Visualizations
1. Training and Evaluation Loss

2. Training and Evaluation Token Accuracy

3. Learning Rate Schedule

4. Gradient Norm over Steps

Conversational Generation Samples
Below are raw, uncurated multi-turn conversational outputs generated by the fully-trained model.
As a 134M parameter model trained on a lightweight pre-training base (~10B tokens), the model successfully acquired conversational turn-taking structure and correct token-stop behavior (properly outputting the <|im_end|> token). However, it exhibits clear cognitive and logical limitations—frequently hallucinating factual explanations, outputting non sequiturs, and occasionally showing context-drift (cross-contamination of personas from earlier turns in the history).
Sample 1: Style / Persona Transfer
User: Act as a polite pirate and tell me how to find a good cup of coffee.
Model: I'm sorry for any confusion, but as an AI text-based model, I don't have the capability to provide information about human health or life history purposes. However, you can try checking out my latest news article by Google Calendar or searching online at some popular websites like Udemy or Walt Disney World.com. If you have any specific questions or need assistance with finding something that suits your needs, feel free to ask!
Sample 2: Instruction Following with Conceptual Association
User: Pretend you are a professional chef. Give me one quick tip on how to cook a perfect egg.
Model: You're welcome! When it comes to preparing a perfect egg, consider joining our team of specialists in flavorful recipes, nutritional advice from experts in cooking, ingredients, techniques from culinary professionals, or just a brief description of what we want to achieve in our kitchen. Make sure to include all the details you've prepared before starting this new experiment. Let's get started today!
Sample 3: Short-Form Professional Writing (Subject Line truncation)
User: Write a quick, polite email to my boss saying I will be 15 minutes late today because of traffic.
Model: Subject: Request for Meeting Your Perfect Milk?
Sample 4: Creative Poetry & Context-Drift (Cross-contamination)
User: Write a very short, 4-line poem about a cat sleeping on a sunny window.
Model: Welcome aboard - yes... We'll make ya waitin'.
(Note: The model correctly remembered the pirate context from prompt 1, but slipped and applied it under this creative prompt due to its tiny 134M size).
Sample 5: Factual Explanation
User: Why is the sky blue? Explain it in one simple sentence.
Model: The sky looks bright red when seen from above - it appears to be clearer than the rest of the night since its rays start reaching their highest point around midnight. This colors give us a glowing look underneath the horizon – if you were to spot a particular star, you might even see such stars stretched tall over millions of years. And the golden light casts a warm shade through the faded palette of silver and gold in every corner. It seems there was another celestial phenomenon here; perhaps a galaxy could emerge from its presence? That would not be complete without knowing more about these mysterious entities.
Benchmarks & Evaluation
To empirically evaluate the model's factual knowledge, common-sense reasoning, and linguistic fluency, we benchmarked slm-124M-llama32k-fwedu-sft using EleutherAI's lm-evaluation-harness.
We compared our model against established, similarly sized open-source instruction-tuned models in the 120M–160M parameter range:
HuggingFaceTB/SmolLM-135M-Instruct (Hugging Face's first-generation 135M assistant).
HuggingFaceTB/SmolLM2-135M-Instruct (Hugging Face's highly optimized second-generation 135M assistant).
vicgalle/gpt2-alpaca (A community instruction-tuned GPT-2 124M model).
Unified Comparison Summary (0-Shot)
Table with columns: Model Name, Parameters, HellaSwag (acc_norm), SciQ (acc_norm), WikiText-2 (word_perplexity)| Model Name | Parameters | HellaSwag (acc_norm) | SciQ (acc_norm) | WikiText-2 (word_perplexity) |
|---|
| slm-124M-llama32k-fwedu-sft (This One) | 124M | 27.64% | 61.50% | 127.17 |
| vicgalle/gpt2-alpaca | 124M | 31.26% | 68.50% | 51.20 |
For HellaSwag & SciQ : Higher is Better
For Wikitext-2 : Lower is Better
Usage (Streaming Chat with 2x RoPE Context Expansion)
You can run the model locally on either a GPU or a CPU .
The Python script below includes:
- Dynamic 2x RoPE Scaling to mathematically stretch the context window from 1,024 to 2,048 tokens during inference.
- Real-time Token Streaming (
TextStreamer) so replies print instantly as they are decoded.
- A Client-Side Rolling Context Window that automatically discards oldest turns when the chat history passes 1,798 tokens, preventing context-overflow crashes during indefinite chat sessions.
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
model_id = "rijuludar/slm-124M-llama32k-fwedu-sft"
hf_token = os.environ.get("HF_TOKEN", None)
if torch.cuda.is_available():
if torch.cuda.get_device_properties(0).major < 7:
device = "cpu"
else:
device = "cuda"
else:
device = "cpu"
print(f"🖥️ Active Inference Device: {device.upper()}")
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)
tokenizer.model_max_length = 2048
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print(f"Loading model with 2x RoPE context expansion...")
model = AutoModelForCausalLM.from_pretrained(
model_id,
token=hf_token,
rope_parameters={
"rope_type": "dynamic",
"factor": 2.0,
"rope_theta": 10000.0
}
).to(device)
print("\n" + "="*50)
print("🤖 ChatML Multi-Turn Streaming Chatbot Active!")
print("Type 'exit' or 'quit' to end the session.")
print("Type 'clear' to reset the conversation history.")
print("="*50 + "\n")
chat_history = []
MAX_EXPANDED_LIMIT = 2048
HEADROOM = 250
THRESHOLD = MAX_EXPANDED_LIMIT - HEADROOM
while True:
user_input = input("\nYou: ").strip()
if not user_input:
continue
if user_input.lower() in ("exit", "quit"):
print("\nGoodbye!")
break
if user_input.lower() == "clear":
chat_history = []
print("\n🧹 Conversation history cleared!")
continue
chat_history.append({"role": "user", "content": user_input})
while True:
prompt = tokenizer.apply_chat_template(chat_history, tokenize=False, add_generation_prompt=True)
token_count = len(tokenizer.encode(prompt))
if token_count > THRESHOLD and len(chat_history) > 2:
chat_history.pop(0)
chat_history.pop(0)
else:
break
inputs = tokenizer(prompt, return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
print("\nAssistant: ", end="", flush=True)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=HEADROOM,
do_sample=True,
temperature=0.75,
top_p=0.9,
repetition_penalty=1.2,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
streamer=streamer
)
new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
assistant_response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
chat_history.append({"role": "assistant", "content": assistant_response})