Rate Limits
Understand the rate limits for Friendli Serverless Endpoints, including Requests per Minute (RPM) and Tokens per Minute (TPM), to ensure efficient usage of resources and balanced performance when interacting with AI models.
When interacting with Friendli Serverless Endpoints, it’s important to be aware of the rate limits imposed on requests. These limits are in place to regulate the number of requests made within a specified timeframe, ensuring a balanced and efficient use of resources. The rate limits are quantified using two metrics:
- RPM (Requests per Minute): This measures the maximum number of requests allowed per minute.
- TPM (Tokens per Minute): TPM represents the maximum estimated tokens processed per minute, providing insight into the computational load.
RPM is used for all types of generation models, while TPM is used only for text generation models. The information related to the rate limits is included in the response headers as follows:
- In all responses
X-RateLimit-Limit-Requests
X-RateLimit-Remaining-Requests
X-RateLimit-Reset-Requests
- In text generation responses
X-RateLimit-Limit-Tokens
X-RateLimit-Remaining-Tokens
X-RateLimit-Reset-Tokens
The specific rate limits applied depend on the user’s subscription plan, with higher-tier plans enjoying fewer restrictions. The following table illustrates the rate limits corresponding to each plan:
Plan | RPM | TPM |
---|---|---|
Trial | 5K | 50K |
Basic | 10K | 100K |
Enterprise | No limit | No limit |
The metrics are measured per team across all models.
Was this page helpful?