Endpoints

What are Endpoints?

Endpoints are the actual deployments of your models on a dedicated GPU resource. They provide a stable and efficient interface to serve your models in real-world applications, ensuring high availability and optimized performance. With endpoints, you can manage model versions, scale resources, and seamlessly integrate your model into production environments.

Key Capabilities of Endpoints:

Efficient Model Serving: Deploy models on powerful GPU instances optimized for your use case.
Flexibility with Multi-LoRA Models: Serve multiple fine-tuned adapters alongside base models.
Autoscaling: Automatically adjust resources to handle varying workloads, ensuring optimal performance and cost efficiency.
Monitoring and Management: Check endpoint health, adjust configurations, and view logs directly from the platform.
Interactive Testing: Use the integrated playground to test your models before integrating them into applications.
API Integration: Access your models via robust OpenAI-compatible APIs, enabling easy integration into any system.

Creating Endpoints

You can create your endpoint by specifying the name, the model, and the instance configuration, consisting of your desired GPU specification.

Selecting Instance

Instance selection depends on your model size and workload.

If the selected GPU type or count is insufficient for the model size, the instance may not be selectable. If the configuration is expected to prevent you from fully leveraging the capabilities of the Friendli Inference Engine, you’ll see a TIGHT MEMORY warning. In such cases, we recommend enabling Online Quantization or increasing the GPU count.

Online Quantization

Skip the hassle of preparing a quantized model. By enabling Online Quantization, your model will be automatically quantized at runtime using Friendli’s proprietary method—preserving quality while improving speed and cost-efficiency.
This allows you to select lower-VRAM GPU instances without performance loss.

Some models (e.g., those already quantized) may not be compatible with Online Quantization.
In certain cases, specific GPU instance types may not be available when this option is enabled.

Intelligent Autoscaling

Our autoscaling system automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs.

How Autoscaling Works

Minimum Replicas:
- When set to 0, the endpoint enters sleeping status during periods of inactivity, helping to minimize costs
- When set to a value greater than 0, the endpoint maintains at least that number of active replicas at all times
Maximum Replicas: Defines the upper limit of replicas that can be created to handle increased traffic load
Cooldown Period: The time delay before scaling down an active replica. This ensures the system doesn’t prematurely reduce capacity during temporary drops in traffic.

Benefits of Autoscaling

Cost Optimization: Pay only for the resources you need by automatically scaling to zero during idle periods
Performance Management: Handle traffic spikes efficiently by automatically adding replicas
Resource Efficiency: Maintain optimal resource utilization across varying workload patterns

Serving Multi-LoRA Models

You can serve Multi-LoRA models using Friendli Dedicated Endpoints. For an overview of Multi-LoRA models, refer to our document on serving Multi-LoRA models with Friendli Container. In Friendli Dedicated Endpoints, Multi-LoRA model is supported only in Enterprise plan. For pricing and availability, Contact sales.

Checking Endpoint Status

After creating the Endpoint, you can view its health status and Endpoint URL on the Endpoint’s details page.

The cost of using dedicated endpoints accumulates from the INITIALIZING status. Specifically, charges begin after the Initializing GPU phase, where the endpoint waits to acquire the GPU. The endpoint then downloads and loads the model onto the GPU, which usually takes less than a minute.

Model max context length and KV cache size

Model max context length is determined by how the model was trained, while KV cache size depends on memory and is affected by your instance type and Online Quantization setting. In some workloads, having a KV cache size smaller than the model max context length may still work fine.
However, to fully leverage the performance of the Friendli Inference Engine, we recommend first enabling Online Quantization (as it doesn’t require changing your instance), and then, if needed, selecting a GPU with more VRAM or increasing the number of GPUs.

Using Playgrounds

To test the deployed model via the web, we provide a playground interface where you can interact with the model using a user-friendly chat interface. Simply enter your query, adjust your settings, and generate your responses!

Send inference queries to your model through our API at the given endpoint address, accessible on the endpoint information tab.

Get Started

Core Concepts

Products

What are Endpoints?

Key Capabilities of Endpoints:

Creating Endpoints

Selecting Instance

Online Quantization

Intelligent Autoscaling

How Autoscaling Works

Benefits of Autoscaling

Serving Multi-LoRA Models

Checking Endpoint Status

Model max context length and KV cache size

Using Playgrounds

Get Started

Core Concepts

Products

​What are Endpoints?

​Key Capabilities of Endpoints:

​Creating Endpoints

​Selecting Instance

​Online Quantization

​Intelligent Autoscaling

​How Autoscaling Works

​Benefits of Autoscaling

​Serving Multi-LoRA Models

​Checking Endpoint Status

​Model max context length and KV cache size

​Using Playgrounds

What are Endpoints?

Key Capabilities of Endpoints:

Creating Endpoints

Selecting Instance

Online Quantization

Intelligent Autoscaling

How Autoscaling Works

Benefits of Autoscaling

Serving Multi-LoRA Models

Checking Endpoint Status

Model max context length and KV cache size

Using Playgrounds