Introduction
As the demand for highly specialized AI capabilities surges, deploying multiple customized large language models (LLMs) without additional GPU resources represents a significant leap forward. The Friendli Engine addresses this challenge through Multi-LoRA (Low-Rank Adaptation) serving. This method lets you simultaneously serve multiple LLMs optimized for specific tasks, without extensive retraining. This advancement opens new avenues for AI efficiency and adaptability, promising to revolutionize the deployment of AI solutions on constrained hardware. This article provides an overview of efficient serving Multi-LoRA models with the Friendli Engine.
Prerequisite
Installhuggingface-cli in your local environment.
Downloading adapter checkpoints
Download each adapter model you want to serve to your local storage.If an adapter’s Hugging Face repo does not contain
adapter_model.safetensors checkpoint file, you have to manually convert adapter_model.bin into adapter_model.safetensors.
You can use the official app or the python script for conversion.Launch Friendli Engine in container
When you have prepared adapter model checkpoints, now you can serve the Multi-LoRA model with Friendli Container. In addition to the command for running the base model, you have to add the--adapter-model argument.
--adapter-model: Add an adapter model with adapter name and path. The path can be Hugging Face hub’s name.
[LAUNCH_OPTIONS] at Running Friendli Container: Launch Options.
If you want to launch with multiple adapters, you can use
--adapter-model with comma-separated string.(e.g. --adapter-model "adapter_name_0:/adapter/model1,adapter_name_1:/adapter/model2")Example: Llama 2 7B Chat + LoRA Adapter
This is an example that runsmeta-llama/Llama-2-7b-chat-hf with FinGPT/fingpt-forecaster_dow30_llama2-7b_lora adapter model.
Sending a request to a specific adapter
You can generate an inference result from a specific adapter model by specifyingmodel in the body of an inference request.
For example, assuming you set the launch option of --adapter-model to “<adapter-model-name>:<adapter-file-path>”, you can send a request to the adapter model as follows.
Sending a request to the base model
If you omit themodel field in your request, the base model will be used for generating an inference request.
You can send a request to the base model as shown below.