Pricing build to scale with your growth
Fast, reliable, and affordable inference at any scale. Get started instantly with self-serve, or contact us for enterprise deployments.
Serverless endpoints
Run the fastest frontier model inference with a simple API call.
Dedicated endpoints
Run dedicated inference with unmatched speed and reliability at scale.
Container
Run inference with full control and performance in your environment.
Serverless API Pricing
Get instant access to the fastest frontier model inference with a simple API call.
Text and vision
Pay per token or GPU time
Model
$ / 1M tokens
EXAONE-4.0.1-32B
$0.6 input, $1 output
Llama-3.1-8B-Instruct
$0.1
Llama-3.3-70B-Instruct
$0.6
Qwen3-235B-A22B-Instruct-2507
$0.2 input, $0.8 output
Model
$ / second
Mistral-Small-3.1-24B-Instruct-2503
$0.002
Magistral-Small-2506
$0.002
Llama-4-Scout-17B-16E-Instruct
$0.002
gemma-3-27b-it
$0.002
Devstral-Small-2505
$0.002
Qwen3-32B
$0.002
Qwen3-30B-A3B
$0.002
A.X-3.1
$0.002
HyperCLOVAX-SEED-Think-14B
$0.002
A.X-4.0
$0.002
Llama-4-Maverick-17B-128E-Instruct
$0.004
DeepSeek-R1-0528
$0.004
Qwen3-235B-A22B-Thinking-2507
$0.004
GLM-4.6
$0.004
Discounts for prompt caching are available for enterprise deployments. Contact us to learn more.
Dedicated Endpoints Pricing
Get instant access to the fastest frontier model inference with a simple API call.
Basic
Get started with:
- Pay-as-you-go
- On-demand GPUs
- Support for custom, fine-tuned, and open-source models
- Automatic traffic-based scaling
- Real-time performance, usage, and log visibility
- Zero-downtime model updates
- Multi-LoRA support
- SOC2 compliance
- Email and in-app chat support
Enterprise
Everything in Basic, plus:
- Reserved GPUs
- Priority access to high-demand GPU types
- Hands-on engineering expertise
- Dedicated Slack support
- VPC and on-prem deployment options
- Enterprise-grade security and compliance
- Custom global region deployment
- 99.99% availability SLAs
- Discounts on monthly reserved GPUs
On-demand deployment
Only pay for the compute you use, down to the second, with no extra charges for start-up times
GPU Type
$ / hour (billed per second)
A100 80GB GPU
$2.9
H100 80GB GPU
$3.9
H200 141GB GPU
$4.5
B200 192GB GPU
$8.9
Results vary by use case, but we often observe 2-3x higher throughput and faster speed on FriendliAI compared to open source inference engines.
Container Pricing
Run inference with full control and performance in your environment.