Skip to main content
POST
/
serverless
/
v1
/
audio
/
transcriptions
Audio transcriptions
curl --request POST \
  --url https://api.friendli.ai/serverless/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form file='@example-file' \
  --form model=openai/whisper-large-v3
{
  "text": "Hello, how are you?",
  "usage": {
    "type": "tokens",
    "input_tokens": 20,
    "output_tokens": 10,
    "total_tokens": 30,
    "input_audio_length_ms": 18000,
    "processed_audio_length_ms": 24000,
    "input_token_details": {
      "audio_tokens": 10,
      "text_tokens": 10
    }
  }
}
Given an audio file, the model transcribes it into text. See available models at this pricing table. To request successfully, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token. When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json. You can view the schema of the streamed sequence of chunk objects in streaming mode here.
You can explore examples on the Friendli Serverless Endpoints playground and adjust settings with just a few clicks.

Authorizations

Authorization
string
header
required

When using Friendli Suite API for inference requests, you need to provide a Friendli Token for authentication and authorization purposes.

For more detailed information, please refer here.

Headers

X-Friendli-Team
string | null

ID of team to run requests as (optional parameter).

Body

multipart/form-data
model
string
required

Code of the model to use. See available model list.

Example:

"openai/whisper-large-v3"

file
file
required

The audio file object (not file name) to transcribe, in one of these formats: mp3, wav, flac, ogg, and many other standard audio formats.

chunking_strategy

Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block.

Allowed value: "auto"
language
string | null

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.

stream
boolean | null

Whether to stream the transcription result. When set to true, the transcription result will be streamed as server-sent events once generated.

temperature
number | null

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

Response

Successfully transcribed the audio file.

text
string
required

The transcribed text.

usage
AudioTranscriptionUsage · object
required