inference-optimization

gemma-4-1.3B-0.3B-tiny

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Base Model: google/gemma-4-26B-A4B
  • Architecture: gemma4 (vision-language model with MoE)
  • Total Parameters: 1.28B
  • Activated Parameters: ~0.32B (25% activation ratio from 4/16 experts)

Configuration Changes

The following parameters were reduced from the original model:

Text Model Configuration

Table
ParameterOriginalTinyNotes
num_hidden_layers306Reduced to 1 cycle of attention pattern
layer_types5 sliding + 1 full (5x)5 sliding + 1 full (1x)Maintains architecture pattern
hidden_size28162048Reduced by ~27%
intermediate_size21121536Scaled proportionally
moe_intermediate_size704512Scaled proportionally
num_experts12816Reduced by 8x
top_k_experts84Reduced by 2x
num_attention_heads1616Kept same
num_key_value_heads88Kept same
vocab_size262144262144Kept same for tokenizer compatibility

Vision Model Configuration

Table
ParameterOriginalTinyNotes
num_hidden_layers2712Reduced by ~55%
hidden_size11521024Reduced by ~11%
intermediate_size43043840Scaled proportionally
num_attention_heads1616Kept same
num_key_value_heads1616Kept same

Checkpoint Structure

The model is saved as a single safetensors file (model.safetensors), which is appropriate for its size. The original model uses a sharded checkpoint with 2 files.

The tensor structure maintains full compatibility with the original Gemma4 architecture:

  • Language model with MoE layers (experts, router, etc.)
  • Vision tower with encoder layers
  • Vision embedding projection
  • Shared embedding tokens

Usage

python

from transformers import Gemma4ForConditionalGeneration, AutoTokenizer
import torch
model = Gemma4ForConditionalGeneration.from_pretrained(
"inference-optimization/gemma-4-1.3B-0.3B-tiny",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("inference-optimization/gemma-4-1.3B-0.3B-tiny")
# Text-only generation
inputs = tokenizer("According to all known laws of aviation, ", return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Validation

The model was validated with the following results:

markdown

Testing generation:
==================================================
According
==================================================
Testing perplexity:
Perplexity: 4.1527
✓ Success: 4.1527 <= 10.0
Model info:
Total parameters: 1.28B
Text layers: 6
Vision layers: 12

The model was fine-tuned on a small copypasta dataset and achieved a perplexity of 1.92 on the training data, demonstrating that it can successfully learn patterns.

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

  1. Config Inspection: Analyzed the original model's configuration to understand layer types, attention patterns, and MoE structure
  2. Model Creation: Created a tiny version maintaining the same architectural patterns (sliding/full attention mix, MoE structure)
  3. Fine-tuning: Fine-tuned on copypasta dataset to validate learning capability (reached perplexity of 1.92)
  4. Validation: Confirmed the model loads correctly and can generate text (perplexity: 4.15)

Notes

  • This is a vision-language model (Gemma4ForConditionalGeneration) with both text and vision components
  • The model uses Mixture of Experts (MoE) architecture with 16 experts and top-4 routing
  • The attention pattern alternates between sliding window attention (5 layers) and full attention (1 layer)
  • The model was initialized with random weights and fine-tuned on a toy dataset, so it should not be used for production - it's intended for testing compression algorithms, benchmarking, and development purposes
  • For vision tasks, additional setup may be required (processor, image inputs, etc.)

Model provider

inference-optimization

Model tree

Base

google/gemma-4-26B-A4B

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today