Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

At a glance

Base modelzai-org/GLM-4.7-Flash
FormatBF16
Total params30B
Active / token
Experts / layer64
Layers47
Hidden size2048
Context202,752
On-disk size120 GB

Which variant should I pick?

VariantFormatLink
GLM-4.7-Flash (this)BF16link
GLM-4.7-Flash-DPODPOlink
GLM-4.7-Flash-SFTSFTlink
GLM-4.7-Flash-ToolsToolslink

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

zai-org/GLM-4.7-Flash

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today