inference-optimization

gemma-4-1.3B-0.3B-tiny

Deploy Dedicated

README

License: apache-2.0

Model Details

Base Model: google/gemma-4-26B-A4B
Architecture: gemma4 (vision-language model with MoE)
Total Parameters: 1.28B
Activated Parameters: ~0.32B (25% activation ratio from 4/16 experts)

Configuration Changes

The following parameters were reduced from the original model:

Text Model Configuration

Table with columns: Parameter, Original, Tiny, Notes
Parameter	Original	Tiny	Notes
num_hidden_layers	30	6	Reduced to 1 cycle of attention pattern
layer_types	5 sliding + 1 full (5x)	5 sliding + 1 full (1x)	Maintains architecture pattern
hidden_size	2816	2048	Reduced by ~27%
intermediate_size	2112	1536	Scaled proportionally
moe_intermediate_size	704	512	Scaled proportionally
num_experts	128	16	Reduced by 8x
top_k_experts	8	4	Reduced by 2x
num_attention_heads	16	16	Kept same
num_key_value_heads	8	8	Kept same
vocab_size	262144	262144	Kept same for tokenizer compatibility

Vision Model Configuration

Table with columns: Parameter, Original, Tiny, Notes
Parameter	Original	Tiny	Notes
num_hidden_layers	27	12	Reduced by ~55%
hidden_size	1152	1024	Reduced by ~11%
intermediate_size	4304	3840	Scaled proportionally
num_attention_heads	16	16	Kept same

Checkpoint Structure

The model is saved as a single safetensors file (model.safetensors), which is appropriate for its size. The original model uses a sharded checkpoint with 2 files.

The tensor structure maintains full compatibility with the original Gemma4 architecture:

Language model with MoE layers (experts, router, etc.)
Vision tower with encoder layers
Vision embedding projection
Shared embedding tokens

Usage

python
from transformers import Gemma4ForConditionalGeneration, AutoTokenizer
import torch

model = Gemma4ForConditionalGeneration.from_pretrained(
    "inference-optimization/gemma-4-1.3B-0.3B-tiny",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/gemma-4-1.3B-0.3B-tiny")

# Text-only generation
inputs = tokenizer("According to all known laws of aviation, ", return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Validation

The model was validated with the following results:

markdown
Testing generation:
==================================================
According                                                
==================================================

Testing perplexity:
Perplexity: 4.1527
✓ Success: 4.1527 <= 10.0

Model info:
  Total parameters: 1.28B
  Text layers: 6
  Vision layers: 12

The model was fine-tuned on a small copypasta dataset and achieved a perplexity of 1.92 on the training data, demonstrating that it can successfully learn patterns.

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

Config Inspection: Analyzed the original model's configuration to understand layer types, attention patterns, and MoE structure
Model Creation: Created a tiny version maintaining the same architectural patterns (sliding/full attention mix, MoE structure)
Fine-tuning: Fine-tuned on copypasta dataset to validate learning capability (reached perplexity of 1.92)
Validation: Confirmed the model loads correctly and can generate text (perplexity: 4.15)

Notes

This is a vision-language model (Gemma4ForConditionalGeneration) with both text and vision components
The model uses Mixture of Experts (MoE) architecture with 16 experts and top-4 routing
The attention pattern alternates between sliding window attention (5 layers) and full attention (1 layer)
The model was initialized with random weights and fine-tuned on a toy dataset, so it should not be used for production - it's intended for testing compression algorithms, benchmarking, and development purposes
For vision tasks, additional setup may be required (processor, image inputs, etc.)

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

inference-optimization

Model Tree

Base

google/gemma-4-26B-A4B

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Base Model: google/gemma-4-26B-A4B
Architecture: gemma4 (vision-language model with MoE)
Total Parameters: 1.28B
Activated Parameters: ~0.32B (25% activation ratio from 4/16 experts)

Configuration Changes

The following parameters were reduced from the original model:

Text Model Configuration

Table with columns: Parameter, Original, Tiny, Notes
Parameter	Original	Tiny	Notes
num_hidden_layers	30	6	Reduced to 1 cycle of attention pattern
layer_types	5 sliding + 1 full (5x)	5 sliding + 1 full (1x)	Maintains architecture pattern
hidden_size	2816	2048	Reduced by ~27%
intermediate_size	2112	1536	Scaled proportionally
moe_intermediate_size	704	512	Scaled proportionally
num_experts	128	16	Reduced by 8x
top_k_experts	8	4	Reduced by 2x
num_attention_heads	16	16	Kept same
num_key_value_heads	8	8	Kept same
vocab_size	262144	262144	Kept same for tokenizer compatibility

Vision Model Configuration

Table with columns: Parameter, Original, Tiny, Notes
Parameter	Original	Tiny	Notes
num_hidden_layers	27	12	Reduced by ~55%
hidden_size	1152	1024	Reduced by ~11%
intermediate_size	4304	3840	Scaled proportionally
num_attention_heads	16	16	Kept same

Checkpoint Structure

The model is saved as a single safetensors file (model.safetensors), which is appropriate for its size. The original model uses a sharded checkpoint with 2 files.

The tensor structure maintains full compatibility with the original Gemma4 architecture:

Language model with MoE layers (experts, router, etc.)
Vision tower with encoder layers
Vision embedding projection
Shared embedding tokens

Usage

python
from transformers import Gemma4ForConditionalGeneration, AutoTokenizer
import torch

model = Gemma4ForConditionalGeneration.from_pretrained(
    "inference-optimization/gemma-4-1.3B-0.3B-tiny",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/gemma-4-1.3B-0.3B-tiny")

# Text-only generation
inputs = tokenizer("According to all known laws of aviation, ", return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Validation

The model was validated with the following results:

markdown
Testing generation:
==================================================
According                                                
==================================================

Testing perplexity:
Perplexity: 4.1527
✓ Success: 4.1527 <= 10.0

Model info:
  Total parameters: 1.28B
  Text layers: 6
  Vision layers: 12

The model was fine-tuned on a small copypasta dataset and achieved a perplexity of 1.92 on the training data, demonstrating that it can successfully learn patterns.

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

Config Inspection: Analyzed the original model's configuration to understand layer types, attention patterns, and MoE structure
Model Creation: Created a tiny version maintaining the same architectural patterns (sliding/full attention mix, MoE structure)
Fine-tuning: Fine-tuned on copypasta dataset to validate learning capability (reached perplexity of 1.92)
Validation: Confirmed the model loads correctly and can generate text (perplexity: 4.15)

Notes

This is a vision-language model (Gemma4ForConditionalGeneration) with both text and vision components
The model uses Mixture of Experts (MoE) architecture with 16 experts and top-4 routing
The attention pattern alternates between sliding window attention (5 layers) and full attention (1 layer)
The model was initialized with random weights and fine-tuned on a toy dataset, so it should not be used for production - it's intended for testing compression algorithms, benchmarking, and development purposes
For vision tasks, additional setup may be required (processor, image inputs, etc.)