Endpoints
Endpoints are the actual deployments of your models on your specified GPU resource.
What are Endpoints?
Endpoints are the actual deployments of your models on a dedicated GPU resource. They provide a stable and efficient interface to serve your models in real-world applications, ensuring high availability and optimized performance. With endpoints, you can manage model versions, scale resources, and seamlessly integrate your model into production environments.
Key Capabilities of Endpoints:
- Efficient Model Serving: Deploy models on powerful GPU instances optimized for your use case.
- Flexibility with Multi-LoRA Models: Serve multiple fine-tuned adapters alongside base models.
- Autoscaling: Automatically adjust resources to handle varying workloads, ensuring optimal performance and cost efficiency.
- Monitoring and Management: Check endpoint health, adjust configurations, and view logs directly from the platform.
- Interactive Testing: Use the integrated playground to test your models before integrating them into applications.
- API Integration: Access your models via robust OpenAI-compatible APIs, enabling easy integration into any system.
Creating Endpoints
You can create your endpoint by specifying the name, the model, and the instance configuration, consisting of your desired GPU specification.
Serving Multi-LoRA Models
You can serve Multi-LoRA models using Friendli Dedicated Endpoints. For an overview of Multi-LoRA models, refer to our document on serving Multi-LoRA models with Friendli Container.
Uploading Model Checkpoints
To serve a pre-trained LLM with fine-tuned adapters, you’ll need to upload both the pre-trained LLM (i.e. base model) checkpoints and the adapter model checkpoints to our service.
Adding LoRA Adapters
After uploading, click the Uploaded model
button. You can view the pre-trained LLM you’ve just uploaded.
Once the base model is selected, the Add LoRA Adapter
button will become clickable.
Click this button to select multiple LoRA adapters to serve alongside the base model. Adding LoRA adapters enables you to perform inference on the pre-trained base model or any of the selected adapters. You can also configure a route name for each adapter model; this route name should be appended to the model field in the inference request body when making requests.
Checking Endpoint Status
After creating the Endpoint, you can view its health status and Endpoint URL on the Endpoint’s details page.
The cost of using dedicated endpoints accumulates from the INITIALIZING
status.
Specifically, charges begin after the Initializing GPU
phase, where the endpoint waits to acquire the GPU.
The endpoint then downloads and loads the model onto the GPU, which usually takes less than a minute.
Using Playgrounds
To test the deployed model via the web, we provide a playground interface where you can interact with the model using a user-friendly chat interface. Simply enter your query, adjust your settings, and generate your responses!
Send inference queries to your model through our API at the given endpoint address, accessible on the endpoint information tab.
Was this page helpful?