Skip to main content
Serverless Endpoints are often more economical, with access to a wide range of models. You can pay for the tokens generated or the compute time of your request, depending on the model.

Tier-Based API Rate Limits

Tiers are based on lifetime spending and update automatically. As your usage grows, your tier increases. Or you can move up instantly by purchasing additional credits.
TiersQualificationsRPM (paid model)RPM (free model)Output Token LengthUsage Limits
Tier 0Signed upAdaptive Rate Limits*Adaptive Rate Limits*4KLimited to the free credit issued at sign-up
Tier 1Valid payment method added1006016K$50 / month
Tier 2Total historical spend of $50+1,0001,00016K$500 / month
Tier 3Total historical spend of $500+5,0005,00032K$5,000 / month
Tier 4Total historical spend of $5,000+10,00010,00064K$50,000 / month
Tier 5Contact [email protected]CustomCustomCustomCustom
*Adaptive Rate Limits: Rate limits are applied dynamically based on overall platform conditions.
‘Output Token Length’ is how much the model can write in response. It’s different from ‘Context Length’, which is sum of the input and output tokens.

Billing Methods

Friendli Serverless Endpoints use two different billing methods, Token-Based or Time-Based, depending on the model type.

Token-Based Billing

In a token-based billing model, charges are determined by the number of tokens processed, where each “token” represents an individual unit processed by the model.
Model CodePrice per Token
MiniMaxAI/MiniMax-M2.1Input $0.3 · Output $1.2 / 1M tokens
zai-org/GLM-4.7Input $0.6 · Output $2.2 / 1M tokens
LGAI-EXAONE/EXAONE-4.0.1-32BInput $0.6 · Output $1 / 1M tokens
meta-llama/Llama-3.3-70B-Instruct$0.6 / 1M tokens
meta-llama/Llama-3.1-8B-Instruct$0.1 / 1M tokens
Qwen/Qwen3-235B-A22B-Instruct-2507Input $0.2 · Output $0.8 / 1M tokens

Time-Based Billing

In a time-based billing model, charges are determined by the compute time required to run your inference request, measured in milliseconds. Non-compute latencies, such as network delays or queueing time, are excluded—ensuring you are charged only for the actual model execution time.
A serverless endpoint model can be in either a Warm status, where it’s ready to handle requests instantly, or a Cold status, where it is inactive and requires time to start up.When a model in a cold status receives a request, it undergoes a “warm-up” process that typically takes 7-30 seconds, depending on the model’s size. During this period, requests will be queued, but this warm-up delay is not included in your billable compute time.
Model CodePrice per Second
zai-org/GLM-4.6$0.004 / second
deepseek-ai/DeepSeek-V3.1$0.004 / second
meta-llama/Llama-4-Maverick-17B-128E-Instruct$0.004 / second
meta-llama/Llama-4-Scout-17B-16E-Instruct$0.002 / second
Qwen/Qwen3-235B-A22B-Thinking-2507$0.004 / second
Qwen/Qwen3-30B-A3B$0.002 / second
Qwen/Qwen3-32B$0.002 / second

Free Models

The following models are available for free for a limited time.
Model CodeFree until
LGAI-EXAONE/K-EXAONE-236B-A23BFebruary 12th

FAQs

Your usage tier, which determines your rate limits, increases monthly based on your proof-of-payment. Need a faster upgrade? Reach out anytime at [email protected] — we’re happy to help!
You’ll receive an alert when approaching your monthly cap. Please contact [email protected] to discuss options for increasing your monthly cap. We may help you (1) pay early to reset your monthly cap, or (2) upgrade your plan to increase your monthly cap and unlock more features.
For more questions, contact [email protected].