Dataset Specifications and Upload Guide
Learn how to upload datasets for fine-tuning models on Friendli.
Uploading Datasets
This document explains how to upload datasets for fine-tuning. On Friendli, you can upload datasets via the web interface or the SDK.
You can easily upload datasets through the web interface. Files in .jsonl
and .parquet
formats are supported, and each dataset should be structured as follows:
Conversation
This is the most basic dataset format. The role
field can be system
, user
, or assistant
.
Alpaca (Beta)
Two types of Alpaca datasets are supported as shown below.
For compatibility with the Conversation format, they are automatically converted according to a template during upload. If you do not want automatic conversion, please convert to the Conversation format before uploading, or use the SDK to upload.
Multi-Modal (Image)
For multi-modal inputs, the following three formats are supported for compatibility.
Currently, the web interface does not support local path
, base64
, or PIL.Image
objects. For these cases, please use the SDK to upload.
How to Upload a Dataset
First, go to the ‘Datasets’ section in the Friendli Suite.
Click the ‘New Dataset’ button to start the upload process.
From the dropdown, select ‘Upload a file directly’ option.
Click the File Upload Area in the Dataset file section, or drag and drop the file you want to upload. Then click the ‘Upload’ button to start uploading.
The dataset will be uploaded progressively in the background. Once the upload is complete, you can rename it, add splits, and preview each split.
You can easily upload datasets through the web interface. Files in .jsonl
and .parquet
formats are supported, and each dataset should be structured as follows:
Conversation
This is the most basic dataset format. The role
field can be system
, user
, or assistant
.
Alpaca (Beta)
Two types of Alpaca datasets are supported as shown below.
For compatibility with the Conversation format, they are automatically converted according to a template during upload. If you do not want automatic conversion, please convert to the Conversation format before uploading, or use the SDK to upload.
Multi-Modal (Image)
For multi-modal inputs, the following three formats are supported for compatibility.
Currently, the web interface does not support local path
, base64
, or PIL.Image
objects. For these cases, please use the SDK to upload.
How to Upload a Dataset
First, go to the ‘Datasets’ section in the Friendli Suite.
Click the ‘New Dataset’ button to start the upload process.
From the dropdown, select ‘Upload a file directly’ option.
Click the File Upload Area in the Dataset file section, or drag and drop the file you want to upload. Then click the ‘Upload’ button to start uploading.
The dataset will be uploaded progressively in the background. Once the upload is complete, you can rename it, add splits, and preview each split.
Prerequisites
- Head to Friendli Suite and create an account.
- Issue a Friendli Token by going to Personal settings > Tokens.
Make sure to copy and store it securely in a safe place as you won’t be able to see it again after refreshing the page.
For detailed instructions, see Personal Access Tokens.
Step 1. Prepare Your Dataset
Your dataset should be a conversational dataset in .jsonl
or .parquet
format, where each line represents a sequence of messages. Each message in the conversation should include a "role"
(e.g., system
, user
, or assistant
) and "content"
. For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64).
Here’s an example of what it should look like. Note that it’s one line but beautified for readability:
You can access our example dataset ‘FriendliAI/gsm8k’ (for Chat), ‘FriendliAI/sample-vision’ (for Chat with image) and explore some of our quantized generative AI models on our Hugging Face page.
Step 2. Upload Your Dataset
Once you have prepared your dataset, you can upload it to Friendli using the Python SDK.
Install the Python SDK
First, install the Friendli Python SDK:
Upload Your Dataset
Use the following code to create a dataset and upload your samples:
How It Works
Friendli Python SDK doesn’t upload your entire dataset file at once. Instead, it processes your dataset more efficiently:
-
Reads your dataset file line by line: Each line is parsed as a
Sample
object containing a conversation with messages. -
Creates a dataset: A new dataset is created in your Friendli project with the specified modalities (
TEXT
andIMAGE
). -
Uploads each conversation as a separate sample: Rather than uploading the entire file, each conversation (line in the dataset file) becomes an individual sample in the dataset.
-
Organizes by splits: Samples are organized into splits like “train”, “validation”, or “test” for different purposes during fine-tuning.
Environment Variables
Make sure to set the required environment variables:
You can find your Team ID and Project ID in the URL of Friendli Suite, formatted as https://friendli.ai/<teamId>/<projectId>/...
.
View Your Dataset
To view and edit the datasets you’ve uploaded, visit Friendli Suite > Dataset.
Next Steps
Now that you have uploaded your dataset, you can proceed to fine-tune your model.