inference-optimization
gemma-4-1.3B-0.3B-tiny
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
- Base Model: google/gemma-4-26B-A4B
- Architecture: gemma4 (vision-language model with MoE)
- Total Parameters: 1.28B
- Activated Parameters: ~0.32B (25% activation ratio from 4/16 experts)
Configuration Changes
The following parameters were reduced from the original model:
Text Model Configuration
| Parameter | Original | Tiny | Notes |
|---|---|---|---|
| num_hidden_layers | 30 | 6 | Reduced to 1 cycle of attention pattern |
| layer_types | 5 sliding + 1 full (5x) | 5 sliding + 1 full (1x) | Maintains architecture pattern |
| hidden_size | 2816 | 2048 | Reduced by ~27% |
| intermediate_size | 2112 | 1536 | Scaled proportionally |
| moe_intermediate_size | 704 | 512 | Scaled proportionally |
| num_experts | 128 | 16 | Reduced by 8x |
| top_k_experts | 8 | 4 | Reduced by 2x |
| num_attention_heads | 16 | 16 | Kept same |
| num_key_value_heads | 8 | 8 | Kept same |
| vocab_size | 262144 | 262144 | Kept same for tokenizer compatibility |
Vision Model Configuration
| Parameter | Original | Tiny | Notes |
|---|---|---|---|
| num_hidden_layers | 27 | 12 | Reduced by ~55% |
| hidden_size | 1152 | 1024 | Reduced by ~11% |
| intermediate_size | 4304 | 3840 | Scaled proportionally |
| num_attention_heads | 16 | 16 | Kept same |
| num_key_value_heads | 16 | 16 | Kept same |
Checkpoint Structure
The model is saved as a single safetensors file (model.safetensors), which is appropriate for its size. The original model uses a sharded checkpoint with 2 files.
The tensor structure maintains full compatibility with the original Gemma4 architecture:
- Language model with MoE layers (experts, router, etc.)
- Vision tower with encoder layers
- Vision embedding projection
- Shared embedding tokens
Usage
python
from transformers import Gemma4ForConditionalGeneration, AutoTokenizerimport torchmodel = Gemma4ForConditionalGeneration.from_pretrained("inference-optimization/gemma-4-1.3B-0.3B-tiny",device_map="auto",torch_dtype=torch.bfloat16)tokenizer = AutoTokenizer.from_pretrained("inference-optimization/gemma-4-1.3B-0.3B-tiny")# Text-only generationinputs = tokenizer("According to all known laws of aviation, ", return_tensors="pt")inputs = {k: v.to(model.device) for k, v in inputs.items()}outputs = model.generate(**inputs, max_new_tokens=50, pad_token_id=tokenizer.pad_token_id)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Validation
The model was validated with the following results:
markdown
Testing generation:==================================================According==================================================Testing perplexity:Perplexity: 4.1527✓ Success: 4.1527 <= 10.0Model info:Total parameters: 1.28BText layers: 6Vision layers: 12
The model was fine-tuned on a small copypasta dataset and achieved a perplexity of 1.92 on the training data, demonstrating that it can successfully learn patterns.
Creation Process
This model was created using the llm-compressor create-tiny-model skill:
- Config Inspection: Analyzed the original model's configuration to understand layer types, attention patterns, and MoE structure
- Model Creation: Created a tiny version maintaining the same architectural patterns (sliding/full attention mix, MoE structure)
- Fine-tuning: Fine-tuned on copypasta dataset to validate learning capability (reached perplexity of 1.92)
- Validation: Confirmed the model loads correctly and can generate text (perplexity: 4.15)
Notes
- This is a vision-language model (Gemma4ForConditionalGeneration) with both text and vision components
- The model uses Mixture of Experts (MoE) architecture with 16 experts and top-4 routing
- The attention pattern alternates between sliding window attention (5 layers) and full attention (1 layer)
- The model was initialized with random weights and fine-tuned on a toy dataset, so it should not be used for production - it's intended for testing compression algorithms, benchmarking, and development purposes
- For vision tasks, additional setup may be required (processor, image inputs, etc.)
Model provider
inference-optimization
Model tree
Base
google/gemma-4-26B-A4B
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information