Dedicated inference at scale

Run inference with unmatched speed and reliability at scale

“EXAONE models run incredibly fast on FriendliAI’s inference platform, and users are highly satisfied with the performance. With FriendliAI’s support, customers have been able to shorten the time required to test and evaluate EXAONE by several weeks. This has enabled them to integrate EXAONE into their services more quickly, accelerating adoption and driving real business impact.”

Clayton Park, AI Business Team Lead, LG AI Research

Benefits

Production-scale performance and reliability

Dedicated Endpoints allow you to deploy and run models fast, reliably, cost-efficiently at scale.

Maximize inference speed

Unlock low latency and high throughput with our optimized inference stack through our proprietary technology.

Run inference reliably

Ensure 99.99% uptime with our geo-distributed, multi-cloud infrastructure, engineered for reliability at scale.

Scale smarter, spend less

Slash costs with our purpose-built inference stack and scale seamlessly to handle fluctuating traffic.

Deploy the way you need

Serverless

The simplest way to run inference

Start instantly—no configuration needed
Use free built-in tools
Pay per token or GPU time

Try now

On Demand

Dedicated GPU instances

Guarantee performance
Supports custom and 450K+ open-source models
Pay for GPU time

Start deploying

Enterprise Reserved

Reserved GPU instances with discounts

Reserve GPUs for 1+ months
Access exclusive features
Discounted upfront payment

Request reserved instances

Features

The complete inference solution

Speed, reliability, scaling, deployment, and enterprise support. Everything you need to run inference at scale.

Blazing-fast inference

Deliver unmatched speed and throughput with our stack using custom kernels, caching, quantization, speculative decoding, and routing.

Always-on reliability

Guarantee uptime through resilient multi-cloud architecture, automated failover and recovery.

Effortless autoscaling

Scale inference dynamically across GPUs, instantly right-sizing capacity to match demand.

Powerful model tooling

Track performance, usage, and logs in real time, and perform live model updates without disruption.

Simple, optimized deployment

Deploy your models easily in an optimized way, with quantization and speculative decoding ready out of the box.

Enterprise-grade support

Get dedicated engineering, compliance, and VPC support in our SOC 2–compliant environment.

Read our docs

Access the model you want

Access the world’s largest collection of 480,330 models through seamless Hugging Face integration. From text generation to computer vision, launch any model with a single click.

Find your model

Have a custom or fine-tuned model?

We’ll help you deploy it just as easily. Contact us to deploy your model.

Access the model you want

Access the world’s largest collection of 450,000 models through seamless Hugging Face integration. From text generation to computer vision, launch any model with a single click.

Find your model

Have a custom or fine-tuned model?

We’ll help you deploy it just as easily. Contact us to deploy your model.

Pricing

Pay per GPU second for faster speeds, higher rate limits, and lower costs at scale.

	VRAM / GPU	$ / hour (billed per second)
On-demand NVIDIA B200	192GB	$8.9
On-demand NVIDIA H200	141GB	$4.5
On-demand NVIDIA H100	80GB	$3.9
On-demand NVIDIA A100	80GB	$2.9
Enterprise reserved Contact us

Dedicated inference at scale

Production-scale performance and reliability

Maximize inference speed

Run inference reliably

Scale smarter, spend less

Deploy the way you need

Serverless

On Demand

Enterprise Reserved

The complete inference solution

Blazing-fast inference

Always-on reliability

Effortless autoscaling

Powerful model tooling

Simple, optimized deployment

Enterprise-grade support

Access the model you want

Orchestrator-8B

Fara-7B

Ministral-3-14B-Instruct-2512

INTELLECT-3

GELab-Zero-4B-preview

DeepSeek-OCR

gpt-oss-20b-Derestricted

GLM-4.6

Llama-3.1-8B-Instruct

Magistral-Small-2506

A.X-3.1

Ministral-3-14B-Base-2512

Ministral-3-3B-Base-2512

Qwen3-VL-8B-Instruct

gpt-oss-20b

Huihui-Qwen3-VL-8B-Instruct-abliterated

Qwen3-VL-30B-A3B-Instruct

Qwen2.5-VL-7B-Instruct

Llama-4-Scout-17B-16E-Instruct

UI-TARS-7B-DPO

Qwen3-VL-8B-Instruct-abliterated-v2

chandra

Cosmos-Reason1-7B

LightOnOCR-1B-1025

Qwen3-VL-4B-Instruct

MinerU2.5-2509-1.2B

MiMo-Embodied-7B

Huihui-Fara-7B-abliterated

olmOCR-2-7B-1025-FP8

Qwen2.5-VL-3B-Instruct

MiniCPM-o-2_6

CauSight

TRivia-3B

UI-TARS-1.5-7B

Pelican1.0-VL-3B

Monet-7B

olmOCR-2-7B-1025

Rex-Omni

whisper-large-v3

whisper-large-v3-turbo

Phi-4-multimodal-instruct

whisper-small

Voxtral-Small-24B-2507

whisper-large-fa-v1

BUD-E-Whisper

whisper-large-v3

Qwen2-Audio-7B

distil-large-v3

whisper-large-v2

whisper-base-french-lora

typhoon-isan-asr-whisper

whisper-base-urdu-custom-20k

Have a custom or fine-tuned model?

Access the model you want

Have a custom or fine-tuned model?

Pricing

Explore FriendliAI today

Access the model you want

Orchestrator-8B

Fara-7B

Ministral-3-14B-Instruct-2512

INTELLECT-3

GELab-Zero-4B-preview