rijuludar

slm-124M-llama32k-fwedu-sft

README

License: apache-2.0

Model Architecture

The structural parameters are inherited from the pre-trained base model, with the token embeddings expanded to accommodate three new ChatML control tokens.

Table with columns: Parameter, Value
Parameter	Value
Total Parameters	134,110,464 (~134M)
Vocabulary Size	32,003 (32,000 base + 3 ChatML tokens)
Hidden Size	768
Intermediate Size	2,048
Number of Hidden Layers	12
Number of Attention Heads	12
Number of Key-Value Heads	12 (MHA)
Max Sequence Length (Base)	1,024 tokens
Architecture	LlamaForCausalLM

ChatML Special Tokens

Table with columns: Token, Role, Token ID
Token	Role	Token ID
`<\|im_start\|>`	Turn-start delimiter	`32000`
`<\|im_end\|>`	Turn-end delimiter (`eos_token`)	`32001`
`<\|pad\|>`	Explicit padding token	`32002`

SFT Training Configuration

Training was conducted in a fully distributed environment using PyTorch DDP across multiple GPUs.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Hardware	Kaggle GPU Tesla T4 × 2 (32 GB aggregate VRAM)
Precision	fp16
Optimizer	AdamW (`adamw_torch`)
Learning Rate	3e-5 (Peak)
LR Scheduler	Cosine with Warmup
Warmup Ratio	0.03
Weight Decay	0.01

Dataset Processing & Filtration

System messages were stripped from all conversations prior to tokenization to conserve context window budget. Conversations exceeding the 1,024-token base maximum length after ChatML formatting were discarded entirely; no silent truncation was applied, preserving complete conversational turns.

Training Metrics & Visualizations

The training run completed successfully over the full 21,360 steps with highly stable convergence. The validation loss and token accuracy tracked hand-in-hand with training metrics, showing excellent generalization with no signs of overfitting.

Final Training Loss (Step 21,360): 1.1122
Final Evaluation Loss (Step 21,360): 1.4007
Final Evaluation Token Accuracy (Step 21,360): 67.92%

Visualizations

1. Training and Evaluation Loss

2. Training and Evaluation Token Accuracy Training and Evaluation Accuracy

3. Learning Rate Schedule

4. Gradient Norm over Steps

Conversational Generation Samples

Below are raw, uncurated multi-turn conversational outputs generated by the fully-trained model.

As a 134M parameter model trained on a lightweight pre-training base (~10B tokens), the model successfully acquired conversational turn-taking structure and correct token-stop behavior (properly outputting the <|im_end|> token). However, it exhibits clear cognitive and logical limitations—frequently hallucinating factual explanations, outputting non sequiturs, and occasionally showing context-drift (cross-contamination of personas from earlier turns in the history).

Sample 1: Style / Persona Transfer

User: Act as a polite pirate and tell me how to find a good cup of coffee.

Model: I'm sorry for any confusion, but as an AI text-based model, I don't have the capability to provide information about human health or life history purposes. However, you can try checking out my latest news article by Google Calendar or searching online at some popular websites like Udemy or Walt Disney World.com. If you have any specific questions or need assistance with finding something that suits your needs, feel free to ask!

Sample 2: Instruction Following with Conceptual Association

User: Pretend you are a professional chef. Give me one quick tip on how to cook a perfect egg.

Model: You're welcome! When it comes to preparing a perfect egg, consider joining our team of specialists in flavorful recipes, nutritional advice from experts in cooking, ingredients, techniques from culinary professionals, or just a brief description of what we want to achieve in our kitchen. Make sure to include all the details you've prepared before starting this new experiment. Let's get started today!

Sample 3: Short-Form Professional Writing (Subject Line truncation)

User: Write a quick, polite email to my boss saying I will be 15 minutes late today because of traffic.

Model: Subject: Request for Meeting Your Perfect Milk?

Sample 4: Creative Poetry & Context-Drift (Cross-contamination)

User: Write a very short, 4-line poem about a cat sleeping on a sunny window.

Model: Welcome aboard - yes... We'll make ya waitin'. (Note: The model correctly remembered the pirate context from prompt 1, but slipped and applied it under this creative prompt due to its tiny 134M size).

Sample 5: Factual Explanation

User: Why is the sky blue? Explain it in one simple sentence.

Model: The sky looks bright red when seen from above - it appears to be clearer than the rest of the night since its rays start reaching their highest point around midnight. This colors give us a glowing look underneath the horizon – if you were to spot a particular star, you might even see such stars stretched tall over millions of years. And the golden light casts a warm shade through the faded palette of silver and gold in every corner. It seems there was another celestial phenomenon here; perhaps a galaxy could emerge from its presence? That would not be complete without knowing more about these mysterious entities.

Benchmarks & Evaluation

To empirically evaluate the model's factual knowledge, common-sense reasoning, and linguistic fluency, we benchmarked slm-124M-llama32k-fwedu-sft using EleutherAI's lm-evaluation-harness.

We compared our model against established, similarly sized open-source instruction-tuned models in the 120M–160M parameter range:

HuggingFaceTB/SmolLM-135M-Instruct (Hugging Face's first-generation 135M assistant).
HuggingFaceTB/SmolLM2-135M-Instruct (Hugging Face's highly optimized second-generation 135M assistant).
vicgalle/gpt2-alpaca (A community instruction-tuned GPT-2 124M model).

Unified Comparison Summary (0-Shot)

Table with columns: Model Name, Parameters, HellaSwag (acc_norm), SciQ (acc_norm), WikiText-2 (word_perplexity)
Model Name	Parameters	HellaSwag (`acc_norm`)	SciQ (`acc_norm`)	WikiText-2 (`word_perplexity`)
slm-124M-llama32k-fwedu-sft (This One)	124M	27.64%	61.50%	127.17
vicgalle/gpt2-alpaca	124M	31.26%	68.50%	51.20

For HellaSwag & SciQ : Higher is Better

For Wikitext-2 : Lower is Better

Usage (Streaming Chat with 2x RoPE Context Expansion)

You can run the model locally on either a GPU or a CPU .

The Python script below includes:

Dynamic 2x RoPE Scaling to mathematically stretch the context window from 1,024 to 2,048 tokens during inference.
Real-time Token Streaming (TextStreamer) so replies print instantly as they are decoded.
A Client-Side Rolling Context Window that automatically discards oldest turns when the chat history passes 1,798 tokens, preventing context-overflow crashes during indefinite chat sessions.

python
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

model_id = "rijuludar/slm-124M-llama32k-fwedu-sft"
# Paste your HF token string here if your repository is private:
hf_token = os.environ.get("HF_TOKEN", None) 

# --- DEVICE DETECTION ---
if torch.cuda.is_available():
    # Fall back to CPU on legacy GPUs (P100) to avoid PyTorch compilation mismatches
    if torch.cuda.get_device_properties(0).major < 7:
        device = "cpu"
    else:
        device = "cuda"
else:
    device = "cpu"

print(f"🖥️ Active Inference Device: {device.upper()}")

# 1. Load Tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)
tokenizer.model_max_length = 2048 # Adjust safety limit to match RoPE expansion

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 2. Load Model with Native Transformers 5.x RoPE Scaling
print(f"Loading model with 2x RoPE context expansion...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    token=hf_token,
    rope_parameters={
        "rope_type": "dynamic",  # Dynamic NTK-aware scaling
        "factor": 2.0,           # Stretch context window up to 2,048 tokens
        "rope_theta": 10000.0    # Baseline frequency
    }
).to(device)

print("\n" + "="*50)
print("🤖 ChatML Multi-Turn Streaming Chatbot Active!")
print("Type 'exit' or 'quit' to end the session.")
print("Type 'clear' to reset the conversation history.")
print("="*50 + "\n")

chat_history = []

# Context boundaries utilizing 2x RoPE context
MAX_EXPANDED_LIMIT = 2048
HEADROOM = 250
THRESHOLD = MAX_EXPANDED_LIMIT - HEADROOM # 1,798 tokens

while True:
    user_input = input("\nYou: ").strip()
    
    if not user_input:
        continue
        
    if user_input.lower() in ("exit", "quit"):
        print("\nGoodbye!")
        break
        
    if user_input.lower() == "clear":
        chat_history = []
        print("\n🧹 Conversation history cleared!")
        continue

    chat_history.append({"role": "user", "content": user_input})

    # Rolling Context Window checks
    while True:
        prompt = tokenizer.apply_chat_template(chat_history, tokenize=False, add_generation_prompt=True)
        token_count = len(tokenizer.encode(prompt))
        
        if token_count > THRESHOLD and len(chat_history) > 2:
            chat_history.pop(0)  # Remove oldest User turn
            chat_history.pop(0)  # Remove oldest Assistant reply
        else:
            break

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    print("\nAssistant: ", end="", flush=True)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=HEADROOM,
            do_sample=True,
            temperature=0.75,
            top_p=0.9,
            repetition_penalty=1.2, # Essential for small models to prevent loop traps
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            streamer=streamer
        )

    new_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    assistant_response = tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
    chat_history.append({"role": "assistant", "content": assistant_response})

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.