Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelPrimeIntellect/INTELLECT-3
FormatBF16
Total params57B
Active / token
Experts / layer64
Layers46
Hidden size4096
Context131,072
On-disk size114 GB

50% expert-pruned version of PrimeIntellect/INTELLECT-3 using Cerebras REAP (Router-weighted Expert Activation Pruning).

Model Details

PropertyValue
Base ModelPrimeIntellect/INTELLECT-3 (248B MoE)
ArchitectureGLM-4 MoE (glm4_moe)
Compression50% (64 experts pruned)
Remaining Experts64 per layer
Parameters~124B
FormatBF16 SafeTensors
Size107 GB

REAP Configuration

yaml

dataset: 0xSero/glm47-reap-calibration-v2
samples: 1360
- evol-codealpaca-v1: 700 (code generation)
- xlam-function-calling-60k: 330 (function calling)
- SWE-smith-trajectories: 330 (agentic multi-turn)
distance_measure: angular
seed: 42
model_max_length: 2048
compression_ratio: 0.50
prune_method: reap

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"0xSero/INTELLECT-3-57B",
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("0xSero/INTELLECT-3-57B", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Related Models

ModelCompressionFormatSize
INTELLECT-3-REAP-5050%BF16107GB
INTELLECT-3-REAP-50-W4A1650%W4A16 GPTQ~30GB (coming soon)

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

PrimeIntellect/INTELLECT-3

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today