When interacting with Friendli Serverless Endpoints, it’s important to be aware of the rate limits imposed on requests. These limits are in place to regulate the number of requests made within a specified timeframe, ensuring a balanced and efficient use of resources. The rate limits are quantified using two metrics:

  • RPM (Requests per Minute): This measures the maximum number of requests allowed per minute.
  • TPM (Tokens per Minute): TPM represents the maximum estimated tokens processed per minute, providing insight into the computational load.

RPM is used for all types of generation models, while TPM is used only for text generation models. The information related to the rate limits is included in the response headers as follows:

  • In all responses
    • X-RateLimit-Limit-Requests
    • X-RateLimit-Remaining-Requests
    • X-RateLimit-Reset-Requests
  • In text generation responses
    • X-RateLimit-Limit-Tokens
    • X-RateLimit-Remaining-Tokens
    • X-RateLimit-Reset-Tokens

The specific rate limits applied depend on the user’s subscription plan, with higher-tier plans enjoying fewer restrictions. The following table illustrates the rate limits corresponding to each plan:

PlanRPMTPM
Trial5K50K
Basic10K100K
EnterpriseNo limitNo limit

The metrics are measured per team across all models.