inference-optimization

Llama-3.2-0.5B-Instruct

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Architecture: llama
Total Parameters: 0.51B
Activated Parameters: 0.51B (non-MoE)

Configuration Changes

The following parameters were reduced from the original model:

Table with columns: Parameter, Original, Tiny
Parameter	Original	Tiny
num_hidden_layers	16	4
hidden_size	2048	2048
intermediate_size	8192	8192
num_attention_heads	32	32
num_key_value_heads	8	8

Checkpoint Structure

This model uses a single model.safetensors file containing all weights. The checkpoint structure is identical to the original model, with the standard Llama architecture tensors:

model.embed_tokens.weight
model.layers.*.self_attn.{q,k,v,o}_proj.weight
model.layers.*.mlp.{gate,up,down}_proj.weight
model.layers.*.{input,post_attention}_layernorm.weight
model.norm.weight

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct")

input_ids = tokenizer("According to all known laws", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Validation

markdown
Success: 1.0247299671173096 <= 10.0

==================================================
Generating sample text:
According to all known laws of aviation, there is no way a bee should be able to fly
==================================================

Creation Process

This model was created using the llm-compressor create-tiny-model claude skill:

Inspected the original model configuration to identify key parameters
Created a tiny version by reducing num_hidden_layers from 16 to 4
Fine-tuned the model on a toy dataset (famous copypastas) to validate learning capability
Achieved target perplexity of ~1.02 on the validation text
Validated checkpoint structure matches the original model format
Confirmed successful loading and inference

Notes

This model was fine-tuned on a small corpus of internet copypastas to ensure it can learn effectively
The model maintains the same Llama 3.2 architecture (including RoPE parameters) as the base model, just with fewer layers
Due to the reduced layer count, this model has approximately 25% of the original model's parameters
This is intended for development and testing purposes, not production use

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

inference-optimization

Model Tree

Base

meta-llama/Llama-3.2-1B-Instruct

Fine-tuned

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Architecture: llama
Total Parameters: 0.51B
Activated Parameters: 0.51B (non-MoE)

Configuration Changes

The following parameters were reduced from the original model:

Table with columns: Parameter, Original, Tiny
Parameter	Original	Tiny
num_hidden_layers	16	4
hidden_size	2048	2048
intermediate_size	8192	8192
num_attention_heads	32	32
num_key_value_heads	8	8

Checkpoint Structure

This model uses a single model.safetensors file containing all weights. The checkpoint structure is identical to the original model, with the standard Llama architecture tensors:

model.embed_tokens.weight
model.layers.*.self_attn.{q,k,v,o}_proj.weight
model.layers.*.mlp.{gate,up,down}_proj.weight
model.layers.*.{input,post_attention}_layernorm.weight
model.norm.weight

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/Llama-3.2-0.5B-Instruct")

input_ids = tokenizer("According to all known laws", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))

Validation

markdown
Success: 1.0247299671173096 <= 10.0

==================================================
Generating sample text:
According to all known laws of aviation, there is no way a bee should be able to fly
==================================================

Creation Process

This model was created using the llm-compressor create-tiny-model claude skill:

Inspected the original model configuration to identify key parameters
Created a tiny version by reducing num_hidden_layers from 16 to 4
Fine-tuned the model on a toy dataset (famous copypastas) to validate learning capability
Achieved target perplexity of ~1.02 on the validation text
Validated checkpoint structure matches the original model format
Confirmed successful loading and inference

Notes

This model was fine-tuned on a small corpus of internet copypastas to ensure it can learn effectively
The model maintains the same Llama 3.2 architecture (including RoPE parameters) as the base model, just with fewer layers
Due to the reduced layer count, this model has approximately 25% of the original model's parameters
This is intended for development and testing purposes, not production use

Llama-3.2-0.5B-Instruct

README

Model Details

Configuration Changes

Checkpoint Structure

Usage

Validation

Creation Process

Notes

Explore FriendliAI today

README

Model Details

Configuration Changes

Checkpoint Structure

Usage

Validation

Creation Process

Notes