inference-optimization

Llama-4-Scout-1.7B-0.4B-Instruct

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Details

Base Model: meta-llama/Llama-4-Scout-17B-16E-Instruct
Architecture: llama4 (multimodal vision-language with MoE)
Total Parameters: 1.72B
Activated Parameters: ~0.43B (1 expert activated per token out of 4)

Configuration Changes

The following parameters were reduced from the original model:

Table with columns: Parameter, Original, Tiny
Parameter	Original	Tiny
Text Model
num_hidden_layers	48	8
num_local_experts	16	4
num_experts_per_tok	1	1
hidden_size	5120	2048
intermediate_size	8192	3072
intermediate_size_mlp	16384	6144
num_attention_heads	40	16
num_key_value_heads	8	4
layer_types	48 layers (chunked/full pattern)	8 layers (maintains 3:1 pattern)
Vision Model
num_hidden_layers	34	6
hidden_size	1408	768
intermediate_size	5632	3072
num_attention_heads	16	12

Architecture Preservation

The tiny model maintains the original Llama-4-Scout architecture patterns:

MoE Structure: Retained mixture-of-experts with shared expert
Attention Pattern: Maintains the chunked_attention/full_attention pattern (every 4th layer is full_attention)
No-RoPE Layers: Preserved the pattern where 3 out of every 4 layers use alternative position encoding

Checkpoint Structure

The model is saved as a single safetensors file following the original checkpoint structure:

language_model.model.layers.{X}.feed_forward.experts.*
language_model.model.layers.{X}.feed_forward.shared_expert.*
vision_model.model.layers.{X}.*

This structure is compatible with transformers' Llama4ForConditionalGeneration.

Usage

python
from transformers import Llama4ForConditionalGeneration, AutoProcessor

model = Llama4ForConditionalGeneration.from_pretrained(
    "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")

# Text-only input
text = "Hello, world!"
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(processor.tokenizer.decode(outputs[0]))

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

Config Modification: Reduced layers, experts, and hidden dimensions while preserving architectural patterns
Weight Initialization: Randomly initialized weights using the model's init_weights() method
Fine-tuning Attempt: Attempted text-only fine-tuning on a small corpus (note: the multimodal architecture made standard text-only fine-tuning ineffective, but the model structure is valid)
Validation: Verified model loads correctly and can perform inference

Notes

Important: This is a tiny model with randomly initialized weights intended for testing and development purposes only. It is not trained and will not produce meaningful outputs. The vision tower is completely untrained.

Use Cases

Testing model loading and inference pipelines
Validating quantization and compression workflows
Debugging multimodal model handling
CI/CD pipeline testing with realistic model sizes
Memory profiling and optimization experiments

Limitations

Randomly initialized weights (not trained)
Will generate nonsensical outputs
Vision capabilities are non-functional
Not suitable for any production use or evaluation benchmarks

Technical Warnings

When loading this model, you may see the warning:

markdown
[transformers] `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor

This is a known issue with the Llama-4 config and can be safely ignored.

Model provider

inference-optimization

Model tree

Base

meta-llama/Llama-4-Scout-17B-16E-Instruct

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Base Model: meta-llama/Llama-4-Scout-17B-16E-Instruct
Architecture: llama4 (multimodal vision-language with MoE)
Total Parameters: 1.72B
Activated Parameters: ~0.43B (1 expert activated per token out of 4)

Configuration Changes

The following parameters were reduced from the original model:

Table with columns: Parameter, Original, Tiny
Parameter	Original	Tiny
Text Model
num_hidden_layers	48	8
num_local_experts	16	4
num_experts_per_tok	1	1
hidden_size	5120	2048
intermediate_size	8192	3072
intermediate_size_mlp	16384	6144
num_attention_heads	40	16
num_key_value_heads	8	4
layer_types	48 layers (chunked/full pattern)	8 layers (maintains 3:1 pattern)
Vision Model
num_hidden_layers	34	6
hidden_size	1408	768
intermediate_size	5632	3072
num_attention_heads	16	12

Architecture Preservation

The tiny model maintains the original Llama-4-Scout architecture patterns:

MoE Structure: Retained mixture-of-experts with shared expert
Attention Pattern: Maintains the chunked_attention/full_attention pattern (every 4th layer is full_attention)
No-RoPE Layers: Preserved the pattern where 3 out of every 4 layers use alternative position encoding

Checkpoint Structure

The model is saved as a single safetensors file following the original checkpoint structure:

language_model.model.layers.{X}.feed_forward.experts.*
language_model.model.layers.{X}.feed_forward.shared_expert.*
vision_model.model.layers.{X}.*

This structure is compatible with transformers' Llama4ForConditionalGeneration.

Usage

python
from transformers import Llama4ForConditionalGeneration, AutoProcessor

model = Llama4ForConditionalGeneration.from_pretrained(
    "inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("inference-optimization/Llama-4-Scout-1.7B-0.4B-Instruct")

# Text-only input
text = "Hello, world!"
inputs = processor.tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(processor.tokenizer.decode(outputs[0]))

Creation Process

This model was created using the llm-compressor create-tiny-model skill:

Config Modification: Reduced layers, experts, and hidden dimensions while preserving architectural patterns
Weight Initialization: Randomly initialized weights using the model's init_weights() method
Fine-tuning Attempt: Attempted text-only fine-tuning on a small corpus (note: the multimodal architecture made standard text-only fine-tuning ineffective, but the model structure is valid)
Validation: Verified model loads correctly and can perform inference

Notes

Use Cases

Testing model loading and inference pipelines
Validating quantization and compression workflows
Debugging multimodal model handling
CI/CD pipeline testing with realistic model sizes
Memory profiling and optimization experiments

Limitations

Randomly initialized weights (not trained)
Will generate nonsensical outputs
Vision capabilities are non-functional
Not suitable for any production use or evaluation benchmarks

Technical Warnings

When loading this model, you may see the warning:

markdown
[transformers] `rope_parameters`'s high_freq_factor field must be greater than low_freq_factor

This is a known issue with the Llama-4 config and can be safely ignored.

Llama-4-Scout-1.7B-0.4B-Instruct

Get help setting up a custom Dedicated Endpoints.

README

Model Details

Configuration Changes

Architecture Preservation

Checkpoint Structure

Usage

Creation Process

Notes

Use Cases

Limitations

Technical Warnings

Explore FriendliAI today

README

Model Details

Configuration Changes

Architecture Preservation

Checkpoint Structure

Usage

Creation Process

Notes

Use Cases

Limitations

Technical Warnings