Documentation Index
Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Uploading datasets
This document explains how to upload datasets. On Friendli, you can upload datasets via the web interface or the SDK.
You can easily upload datasets through the web interface. Files in .jsonl and .parquet formats are supported, and each dataset should be structured as follows:Conversation
This is the most basic dataset format. The role field can be system, user, or assistant.{"messages": [{"role": "...", "content": "..."}]}
Alpaca (Beta)
Two types of Alpaca datasets are supported as shown below.
For compatibility with the Conversation format, they are automatically converted according to a template during upload. If you do not want automatic conversion, please convert to the Conversation format before uploading, or use the SDK to upload.{"instruction": "...", "output": "..."}
{"instruction": "...", "input": "...", "output": "..."}
Multi-modal (image)
For multi-modal inputs, the following three formats are supported for compatibility.
Currently, the web interface does not support local path, base64, or PIL.Image objects. For these cases, please use the SDK to upload.{"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image": "https://example.com/image.jpg"}]}]}
{"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image_url": "https://example.com/image.jpg"}]}]}
{"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}]}]}
How to upload a dataset
First, go to Friendli Suite > Labs > Datasets.
Click the ‘New Dataset’ button to start the upload process.
From the dropdown, select ‘Upload a file directly’ option.Click the File Upload Area in the Dataset file section, or drag and drop the file you want to upload. Then click the ‘Upload’ button to start uploading.Friendli uploads the dataset progressively in the background. Once the upload is complete, you can rename it, add splits, and preview each split. Prerequisites
- Head to Friendli Suite and create an account.
- Issue a Personal API Key by going to Personal Settings > API Keys.
Make sure to copy and store it securely in a safe place as you won’t be able to see it again after refreshing the page.
For detailed instructions, see Personal API Keys.
Step 1. Prepare your dataset
Your dataset should be a conversational dataset in .jsonl or .parquet format, where each line represents a sequence of messages. Each message in the conversation should include a "role" (e.g., system, user, or assistant) and "content". For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64).Here’s an example of what it should look like. Note that it’s one line but beautified for readability:{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
},
{
"type": "image",
"image": "data:image/png;base64,<base64-encoded-data>"
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
},
{
"role": "assistant",
"content": "The image is a bee."
}
]
}
Step 2. Upload your dataset
Once you have prepared your dataset, you can upload it to Friendli using the Python SDK.Install the Python SDK
First, install the Friendli Python SDK:# Using pip
pip install friendli
# Using poetry
poetry add friendli
Upload your dataset
Use the following code to create a dataset and upload your samples:import os
from friendli.friendli import SyncFriendli
from friendli.models import Sample
TEAM_ID = os.environ["FRIENDLI_TEAM_ID"]
PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"]
TOKEN = os.environ["API_KEY"]
# Read dataset file and parse each line as a Sample
with open("dataset.jsonl", "rb") as f:
data = [Sample.model_validate_json(line) for line in f]
with SyncFriendli(
token=TOKEN,
x_friendli_team=TEAM_ID,
) as friendli:
# Create a new dataset with TEXT and IMAGE modalities
with friendli.dataset.create(
modality=["TEXT", "IMAGE"],
name="my-vlm-dataset", # name of the dataset
project_id=PROJECT_ID,
) as dataset:
# Upload samples to the dataset
# Each line from your dataset file becomes a separate sample
dataset.upload_samples(
samples=data,
split="train", # name of the split to upload to
)
How it works
Friendli Python SDK doesn’t upload your entire dataset file at once. Instead, it processes your dataset more efficiently:
-
Reads your dataset file line by line: Each line is parsed as a
Sample object containing a conversation with messages.
-
Creates a dataset: A new dataset is created in your Friendli project with the specified modalities (
TEXT and IMAGE).
-
Uploads each conversation as a separate sample: Rather than uploading the entire file, each conversation (line in the dataset file) becomes an individual sample in the dataset.
-
Organizes by splits: Samples are organized into splits like “train”, “validation”, or “test” for different purposes.
Environment variables
Make sure to set the required environment variables:export API_KEY="your-api-key"
export FRIENDLI_TEAM_ID="your-team-id"
export FRIENDLI_PROJECT_ID="your-project-id"
You can find your Team ID and Project ID in the URL of Friendli Suite, formatted as https://friendli.ai/<teamId>/<projectId>/....View your dataset
To view and edit the datasets you’ve uploaded, visit Friendli Suite > Datasets.