Pricing build to scale with your growth

Fast, reliable, and affordable inference at any scale. Get started instantly with self-serve, or contact us for enterprise deployments.

Get started Contact us

Serverless Endpoints

Run the fastest frontier model inference with a simple API call.

See pricing

Dedicated Endpoints

Run dedicated inference with unmatched speed and reliability at scale.

See pricing

Container

Run inference with full control and performance in your environment.

Serverless API Pricing

Get instant access to the fastest frontier model inference with a simple API call.

Text and vision

Pay per token or GPU time

Model

$ / 1M tokens

Llama-3.1-8B-Instruct

$0.1

Llama-3.3-70B-Instruct

$0.6

Qwen3-235B-A22B-Instruct-2507

$0.2 Input · $0.8 Output

MiniMax-M2.1

$0.3 Input · $1.2 Output

MiniMax-M2.5

$0.3 Input · $0.15 Cached Input · $1.2 Output

GLM-4.7

$0.6 Input · $2.2 Output

GLM-5

$1 Input · $3.2 Output

Model

$ / second

Llama-4-Scout-17B-16E-Instruct

$0.002

Qwen3-32B

$0.002

Qwen3-30B-A3B

$0.002

Llama-4-Maverick-17B-128E-Instruct

$0.004

Qwen3-235B-A22B-Thinking-2507

$0.004

GLM-4.6

$0.004

DeepSeek-V3.1

$0.004

Discounts for prompt caching are available for enterprise deployments. Contact us to learn more.

Dedicated Endpoints Pricing

Get instant access to the fastest frontier model inference with a simple API call.

Basic

Get started with:

Pay-as-you-go
On-demand GPUs
Support for custom, fine-tuned, and open-source models
Automatic traffic-based scaling
Real-time performance, usage, and log visibility
Zero-downtime model updates
Multi-LoRA support
SOC2 compliance
Email and in-app chat support

Get started

Enterprise

Everything in Basic, plus:

Reserved GPUs
Priority access to high-demand GPU types
Hands-on engineering expertise
Dedicated Slack support
VPC and on-prem deployment options
Enterprise-grade security and compliance
Custom global region deployment
99.99% availability SLAs
Discounts on monthly reserved GPUs

On-demand deployment

Only pay for the compute you use, down to the second, with no extra charges for start-up times

GPU Type

$ / hour (billed per second)

A100 80GB GPU

$2.9

H100 80GB GPU

$3.9

H200 141GB GPU

$4.5

B200 192GB GPU

$8.9

For estimates of per-token prices, see this page. Results vary by use case, but we often observe 2-3x higher throughput and faster speed on FriendliAI compared to open source inference engines.

Container Pricing

Run inference with full control and performance in your environment.

Pricing build to scale with your growth

Serverless Endpoints

Dedicated Endpoints

Container

Serverless API Pricing

Text and vision

Dedicated Endpoints Pricing

Basic

Enterprise

On-demand deployment

Container Pricing

Explore FriendliAI today