What are Endpoints?
Endpoints are the actual deployments of your models on a dedicated GPU resource. They provide a stable and efficient interface to serve your models in real-world applications, ensuring high availability and optimized performance. With endpoints, you can manage model versions, scale resources, and seamlessly integrate your model into production environments.Key Capabilities of Endpoints:
- Efficient Model Serving: Deploy models on powerful GPU instances optimized for your use case.
- Flexibility with Multi-LoRA Models: Serve multiple fine-tuned adapters alongside base models.
- Autoscaling: Automatically adjust resources to handle varying workloads, ensuring optimal performance and cost efficiency.
- Monitoring and Management: Check endpoint health, adjust configurations, and view logs directly from the platform.
- Interactive Testing: Use the integrated playground to test your models before integrating them into applications.
- API Integration: Access your models via robust OpenAI-compatible APIs, enabling easy integration into any system.
Creating Endpoints
You can create your endpoint by specifying the name, the model, and the instance configuration, consisting of your desired GPU specification.
Selecting Instance
Instance selection depends on your model size and workload.
TIGHT MEMORY
warning. In such cases, we recommend enabling Online Quantization or increasing the GPU count.
Available Features

Online Quantization
This feature description is moved to Online Quantization page. Please refer to the page for more information.Speculative Decoding
This feature description is moved to Speculative Decoding page. Please refer to the page for more information.Serving Multi-LoRA Models
This feature description is moved to Multi-LoRA Serving page. Please refer to the page for more information.Custom Chat Templates
Customize chat formatting by uploading your own Jinja templates when creating Dedicated Endpoint instances. This overrides the model’s default chat template and gives you full control over how inputs and outputs are displayed.Reasoning Parsing
You can configure the default behavior of an endpoint by setting theparse_reasoning
configuration during its creation. This default will apply when the corresponding argument is not explicitly provided in incoming requests.
For more details, refer to the Reasoning Parsing with Friendli documentation.
Checking Endpoint Status
After creating the Endpoint, you can view its health status and Endpoint URL on the Endpoint’s details page.
The cost of using dedicated endpoints accumulates from the
INITIALIZING
status.
Specifically, charges begin after the Initializing GPU
phase, where the endpoint waits to acquire the GPU.
The endpoint then downloads and loads the model onto the GPU, which usually takes less than a minute.Model max context length and KV cache size
Model max context length is determined by how the model was trained, while KV cache size depends on memory and is affected by your instance type and Online Quantization setting. In some workloads, having a KV cache size smaller than the model max context length may still work fine.However, to fully leverage the performance of the Friendli Inference Engine, we recommend first enabling Online Quantization (as it doesn’t require changing your instance), and then, if needed, selecting a GPU with more VRAM or increasing the number of GPUs.
Using Playgrounds
To test the deployed model via the web, we provide a playground interface where you can interact with the model using a user-friendly chat interface. Simply enter your query, adjust your settings, and generate your responses!