Model Details
- Base Model: meta-llama/Llama-4-Scout-17B-16E-Instruct
- Architecture: llama4 (multimodal vision-language with MoE)
- Total Parameters: 1.72B
- Activated Parameters: ~0.43B (1 expert activated per token out of 4)
Configuration Changes
The following parameters were reduced from the original model:
Table with columns: Parameter, Original, Tiny| Parameter | Original | Tiny |
|---|
| Text Model | | |
| num_hidden_layers | 48 | 8 |
| num_local_experts | 16 | 4 |
| num_experts_per_tok | 1 | 1 |
| hidden_size | 5120 | 2048 |
| intermediate_size | 8192 | 3072 |
| intermediate_size_mlp | 16384 | 6144 |
| num_attention_heads | 40 | 16 |
| num_key_value_heads | 8 | 4 |
| layer_types | 48 layers (chunked/full pattern) | 8 layers (maintains 3:1 pattern) |
| Vision Model | | |
| num_hidden_layers | 34 | 6 |
| hidden_size | 1408 | 768 |
| intermediate_size | 5632 | 3072 |
| num_attention_heads | 16 | 12 |
Architecture Preservation
The tiny model maintains the original Llama-4-Scout architecture patterns:
- MoE Structure: Retained mixture-of-experts with shared expert
- Attention Pattern: Maintains the chunked_attention/full_attention pattern (every 4th layer is full_attention)
- No-RoPE Layers: Preserved the pattern where 3 out of every 4 layers use alternative position encoding
Checkpoint Structure
The model is saved as a single safetensors file following the original checkpoint structure:
language_model.model.layers.{X}.feed_forward.experts.*
language_model.model.layers.{X}.feed_forward.shared_expert.*
vision_model.model.layers.{X}.*
This structure is compatible with transformers' Llama4ForConditionalGeneration.
Usage
from transformers import Llama4ForConditionalGeneration, AutoProcessor
model = Llama4ForConditionalGeneration.from_pretrained(
"inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")
text = "Hello, world!"
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(processor.tokenizer.decode(outputs[0]))
Creation Process
This model was created using the llm-compressor create-tiny-model skill:
- Config Modification: Reduced layers, experts, and hidden dimensions while preserving architectural patterns
- Weight Initialization: Randomly initialized weights using the model's init_weights() method
- Fine-tuning Attempt: Attempted text-only fine-tuning on a small corpus (note: the multimodal architecture made standard text-only fine-tuning ineffective, but the model structure is valid)
- Validation: Verified model loads correctly and can perform inference
Notes
Important: This is a tiny model with randomly initialized weights intended for testing and development purposes only. It is not trained and will not produce meaningful outputs. The vision tower is completely untrained.
Use Cases
- Testing model loading and inference pipelines
- Validating quantization and compression workflows
- Debugging multimodal model handling
- CI/CD pipeline testing with realistic model sizes
- Memory profiling and optimization experiments
Limitations
- Randomly initialized weights (not trained)
- Will generate nonsensical outputs
- Vision capabilities are non-functional
- Not suitable for any production use or evaluation benchmarks
Technical Warnings
When loading this model, you may see the warning:
[transformers] `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor
This is a known issue with the Llama-4 config and can be safely ignored.