Dedicated inference at scale

Run inference with unmatched speed and reliability at scale

LG AI Research

“EXAONE models run incredibly fast on FriendliAI’s inference platform, and users are highly satisfied with the performance. With FriendliAI’s support, customers have been able to shorten the time required to test and evaluate EXAONE by several weeks. This has enabled them to integrate EXAONE into their services more quickly, accelerating adoption and driving real business impact.”

Clayton Park, AI Business Team Lead, LG AI Research

Benefits

Production-scale performance and reliability

Dedicated Endpoints allow you to deploy and run models fast, reliably, cost-efficiently at scale.

Maximize inference speed

Unlock low latency and high throughput with our optimized inference stack through our proprietary technology.

Run inference reliably

Ensure 99.99% uptime with our geo-distributed, multi-cloud infrastructure, engineered for reliability at scale.

Scale smarter, spend less

Slash costs with our purpose-built inference stack and scale seamlessly to handle fluctuating traffic.

Deploy the way you need

Serverless

The simplest way to run inference

  • Start instantly—no configuration needed
  • Use free built-in tools
  • Pay per token or GPU time

On Demand

Dedicated GPU instances

  • Guarantee performance
  • Supports custom and 500K+ open-source models
  • Pay for GPU time

Enterprise Reserved

Reserved GPU instances with discounts

  • Reserve GPUs for 1+ months
  • Access exclusive features
  • Discounted upfront payment

Features

The complete inference solution

Speed, reliability, scaling, deployment, and enterprise support. Everything you need to run inference at scale.

Blazing-fast inference

Deliver unmatched speed and throughput with our stack using custom kernels, caching, quantization, speculative decoding, and routing.

Always-on reliability

Guarantee uptime through resilient multi-cloud architecture, automated failover and recovery.

Effortless autoscaling

Scale inference dynamically across GPUs, instantly right-sizing capacity to match demand.

Powerful model tooling

Track performance, usage, and logs in real time, and perform live model updates without disruption.

Simple, optimized deployment

Deploy your models easily in an optimized way, with quantization and speculative decoding ready out of the box.

Enterprise-grade support

Get dedicated engineering, compliance, and VPC support in our SOC 2–compliant environment.

Read our docs

Access the model you want

Access the world’s largest collection of 503,141 models through seamless Hugging Face integration. From text generation to computer vision, launch any model with a single click.

Find your model
LLM
google

medgemma-1.5-4b-it

google

LLM
microsoft

OptiMind-SFT

microsoft

Multimodal
Qwen

Qwen3-VL-Embedding-8B

Qwen

Computer Vision
Qwen

Qwen3-VL-2B-Instruct

Qwen

Speech & Audio
mistralai

Voxtral-Mini-3B-2507

mistralai

Video
nanonets

Nanonets-OCR2-3B

nanonets

Speech & Audio
distil-whisper

distil-large-v3.5

distil-whisper

Speech & Audio

MultilingualSTT

Svetozar1993

Speech & Audio
openai

whisper-large-v3

openai

Speech & Audio
openai

whisper-base

openai

LLM
zai-org

GLM-4.7-Flash

zai-org

LLM
google

translategemma-27b-it

google

Multimodal
numind

NuMarkdown-8B-Thinking

numind

Computer Vision
lightonai

LightOnOCR-1B-1025

lightonai

Computer Vision
deepseek-ai

DeepSeek-OCR

deepseek-ai

Video
Qwen

Qwen2.5-VL-3B-Instruct

Qwen

Speech & Audio
distil-whisper

distil-large-v3

distil-whisper

Computer Vision
Qwen

Qwen3-VL-Reranker-8B

Qwen

Video
Qwen

Qwen2.5-VL-7B-Instruct

Qwen

Video
ByteDance

Dolphin-v2

ByteDance

Video
nanonets

Nanonets-OCR-s

nanonets

LLM
ekwek

Soprano-1.1-80M

ekwek

Video
ByteDance-Seed

UI-TARS-1.5-7B

ByteDance-Seed

Video
openbmb

MiniCPM-V-4_5

openbmb

Speech & Audio

52Hz-small-fr-v2

hip94

Speech & Audio
kotoba-tech

kotoba-whisper-v2.2

kotoba-tech

Video
IDEA-Research

Rex-Omni

IDEA-Research

LLM
google

translategemma-12b-it

google

Multimodal
bytedance-research

UI-TARS-7B-DPO

bytedance-research

Multimodal
Qwen

Qwen3-VL-30B-A3B-Instruct

Qwen

Multimodal
Qwen

Qwen3-VL-Embedding-2B

Qwen

Speech & Audio
openbmb

MiniCPM-o-2_6

openbmb

Computer Vision
Qwen

Qwen3-VL-8B-Instruct

Qwen

Video
OpenGVLab

VideoChat-R1_5

OpenGVLab

LLM

DASD-4B-Thinking

Alibaba-Apsara

LLM
zai-org

GLM-4.7

zai-org

LLM
google

translategemma-4b-it

google

Multimodal
Qwen

Qwen3-VL-Reranker-2B

Qwen

Speech & Audio
microsoft

Phi-4-multimodal-instruct

microsoft

Multimodal
lightonai

LightOnOCR-2-1B

lightonai

LLM
nvidia

Orchestrator-8B

nvidia

Speech & Audio
Qwen

Qwen2-Audio-7B-Instruct

Qwen

Computer Vision
ibm-granite

granite-docling-258M

ibm-granite

Video

SocioReasoner-3B

vvangfaye

Video
Qwen

Qwen2.5-VL-72B-Instruct

Qwen

LLM
zai-org

GLM-4.6

zai-org

Multimodal
meta-llama

Llama-4-Scout-17B-16E-Instruct

meta-llama

Video
nvidia

Cosmos-Reason1-7B

nvidia

Speech & Audio
laion

BUD-E-Whisper

laion

LLM
ByteDance-Seed

Stable-DiffCoder-8B-Instruct

ByteDance-Seed

Speech & Audio
openai

whisper-large

openai

Speech & Audio
mistralai

Voxtral-Small-24B-2507

mistralai

LLM
LGAI-EXAONE

K-EXAONE-236B-A23B

LGAI-EXAONE

Video
allenai

olmOCR-2-7B-1025-FP8

allenai

Speech & Audio
openai

whisper-large-v3-turbo

openai

Video
openbmb

MiniCPM-V-2_6

openbmb

Computer Vision
lightonai

LightOnOCR-2-1B-bbox

lightonai

Have a custom or fine-tuned model?

We’ll help you deploy it just as easily. Contact us to deploy your model.

Contact us

Pricing

Pay per GPU second for faster speeds, higher rate limits, and lower costs at scale.

VRAM / GPU

$ / hour (billed per second)

On-demand NVIDIA B200

192GB

$8.9

On-demand NVIDIA H200

141GB

$4.5

On-demand NVIDIA H100

80GB

$3.9

On-demand NVIDIA A100

80GB

$2.9

Enterprise reserved

Explore FriendliAI today