Introduction

Effortlessly fine-tune your Vision Language Model (VLM) with Friendli Dedicated Endpoints, which leverages the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs while preserving model quality, similar to full-parameter fine-tuning. This can make your model become an expert on specific visual tasks and improve its ability to understand and describe images accurately.

In this tutorial, we will cover:

  • How to upload your image-text dataset for VLM fine-tuning.
  • How to fine-tuning state-of-the-art VLMs like Qwen2.5-VL-32B-Instruct and gemma-3-27b-it on your dataset.
  • How to deploy your fine-tuned VLM model.

Table of Contents

Prerequisites

  1. Head to Friendli Suite and create an account.
  2. Issue a Friendli Token and store it safely.

Step 1. Prepare Your Dataset

Your dataset should be a conversational dataset in JSONL format, where each line represents a sequence of messages. Each message in the conversation should include a "role" (e.g., system, user, or assistant) and "content". For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64).

Here’s an example of what it should look like. Note that it’s one line but beautified for readability:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image",
          "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
        },
        {
          "type": "image",
          "image": "data:image/png;base64,<base64-encoded-data>"
        },
        {
          "type": "text",
          "text": "Describe this image in detail."
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The image is a bee."
    }
  ]
}

You can access our example dataset ‘FriendliAI/gsm8k’ (for Chat), ‘FriendliAI/sample-vision’ (for Chat with image) and explore some of our quantized generative AI models on our Hugging Face page.

Step 2. Upload Your Dataset

Once you have prepared your dataset, upload it to Friendli using the Python SDK:

import os

from friendli.friendli import SyncFriendli
from friendli.models import Sample

TEAM_ID = os.environ["FRIENDLI_TEAM_ID"]
PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"]
TOKEN = os.environ["FRIENDLI_TOKEN"]

# Read dataset
with open("dataset.jsonl", "rb") as f:
    data = [Sample.model_validate_json(line) for line in f]

with SyncFriendli(
    token=TOKEN,
    x_friendli_team=TEAM_ID,
) as friendli:
    # Create dataset
    with friendli.dataset.create(
        modality=["TEXT", "IMAGE"],
        name="test-create-dataset-sync",
        project_id=PROJECT_ID,
    ) as dataset:
        # Add samples to dataset
        dataset.add_samples(
            samples=data,
            split="train",
        )

To view and edit the datasets you’ve uploaded, visit Friendli Suite > Dataset.


Step 3. Fine-tune Your VLM

Go to Friendli Suite > Fine-tuning, and click the ‘New job’ button to create a new job.


In the job creation form, you’ll need to configure the following settings:

  1. Job Name:

    • Enter a name for your fine-tuning job.
    • If not provided, a name will be automatically generated (e.g., accomplished-shark).
  2. Model:

    • Choose your base model from one of these sources:
      • Hugging Face: Select from models available on Hugging Face.
      • Weights & Biases: Use a model from your W&B projects.
      • Uploaded model: Use a model you’ve previously uploaded.
  3. Dataset:

    • Select the dataset to use.
  4. Weights & Biases Integration (Optional):

    • Enable W&B tracking by providing your W&B project name.
    • This will allow you to monitor training metrics in W&B.
  5. Hyperparameters:

    • Learning Rate (required): Initial learning rate for optimizer (e.g., 0.0001).
    • Batch Size (required): Total batch size used for training (e.g., 16).
    • Total Number of Training (required), either:
      • Number of Training Epoch: Total number of training epochs to perform (e.g., 1)
      • Training Steps: Total number of training steps to perform (e.g., 1000)
    • Evaluation Steps (required): Number of steps between evaluation of the model using the validation set (e.g., 300).
    • LoRA Rank (optional): Rank of the LoRA parameters (e.g., 16).
    • LoRA Alpha (optional): Scaling factor that determines the influence of the low-rank matrices during fine-tuning (e.g., 32).
    • LoRA Dropout (optional): Dropout rate applied during fine-tuning (e.g., 0.1).

After configuring these settings, click the ‘Create’ button at the bottom to start your fine-tuning job.

Step 4. Monitor Training Progress

You can now monitor your fine-tuning job progress and on Friendli Suite.

If you have integrated your Weights & Biases (W&B) account, you can also monitor the training status in your W&B project. Read our FAQ section on using W&B with dedicated fine-tuning to learn more about monitoring you fine-tuning jobs on their platform.

Step 5. Deploy Your Fine-tuned Model

Once the fine-tuning process is complete, you can immediately deploy the model by clicking the ‘Deploy’ button in the top right corner. The name of the fine-tuned LoRA adapter will be the same as your fine-tuning job name.

For more information about deploying a model, refer to Endpoints documentation.

Resources

Explore these additional resources to learn more about VLM fine-tuning and optimization: