Dedicated chat completions

Given a list of messages forming a conversation, the model generates a response. To request successfully, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token. When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json. You can view the schema of the streamed sequence of chunk objects in streaming mode here.

Authorizations

Authorization

string

header

required

When using Friendli Suite API for inference requests, you need to provide a Friendli Token for authentication and authorization purposes.

For more detailed information, please refer here.

Headers

X-Friendli-Team

string | null

ID of team to run requests as (optional parameter).

Body

application/json

model

string

required

ID of target endpoint. If you want to send request to specific adapter, use the format "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE". Otherwise, you can just use "YOUR_ENDPOINT_ID" alone.

messages

(System · object | User · object | Assistant · object | Tool · object)[]

required

A list of messages comprising the conversation so far.

System
User
Assistant
Tool

Show child attributes

messages.role

string

required

The role of the messages author.

Allowed value: "system"

messages.content

string

required

The content of system message.

messages.name

string | null

The name for the participant to distinguish between participants with the same role.

chat_template_kwargs

Chat Template Kwargs · object

Additional keyword arguments supplied to the template renderer. These parameters will be available for use within the chat template.

eos_token

integer[] | null

A list of endpoint sentence tokens.

frequency_penalty

number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

logit_bias

Logit Bias · object

Accepts a JSON object that maps tokens to an associated bias value. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model.

logprobs

boolean | null

Whether to return log probabilities of the output tokens or not.

max_tokens

integer | null

The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

Example:

200

min_p

number | null

A scaling factor used to determine the minimum token probability threshold. This threshold is calculated as min_p multiplied by the probability of the most likely token. Tokens with probabilities below this scaled threshold are excluded from sampling. Values range from 0.0 (inclusive) to 1.0 (inclusive). Higher values result in stricter filtering, while lower values allow for greater diversity. The default value of 0.0 disables filtering, allowing all tokens to be considered for sampling.

integer | null

The number of independently generated results for the prompt. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

parallel_tool_calls

boolean | null

Whether to enable parallel function calling.

presence_penalty

number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

repetition_penalty

number | null

Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be positive value (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty argument.

seed

Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.

stop

string[] | null

When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.

stream

boolean | null

default:false

Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.

stream_options

StreamOptions · object

Options related to stream. It can only be used when stream: true.

Show child attributes

stream_options.include_usage

boolean | null

When set to true, the number of tokens used will be included at the end of the stream result in the form of "usage": {"completion_tokens": number, "prompt_tokens": number, "total_tokens": number}.

parse_reasoning

boolean | null

Parses model reasoning into reasoning_content while keeping the answer in content. Default value may vary between endpoints.

For more detailed information, please refer here.

include_reasoning

boolean | null

When parse_reasoning=true, include parsed reasoning (reasoning_content). Defaults to true.

For more detailed information, please refer here.

temperature

number | null

Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature argument.

tool_choice

Determines the tool calling behavior of the model. When set to none, the model will bypass tool execution and generate a response directly. In auto mode (the default), the model dynamically decides whether to call a tool or respond with a message. Alternatively, setting required ensures that the model invokes at least one tool before responding to the user. You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}.

Show child attributes

tool_choice.type

string

required

The type of the tool. Currently, only function is supported.

Allowed value: "function"

tool_choice.function

ChatCompleteBodyToolChoiceFunction · object

required

Show child attributes

tool_choice.function.name

string

required

The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

top_k

integer | null

Limits sampling to the top k tokens with the highest probabilities. Values range from 0 (no filtering) to the model's vocabulary size (inclusive). The default value of 0 applies no filtering, allowing all tokens.

top_logprobs

integer | null

The number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.

top_p

number | null

Keeps only the smallest set of tokens whose cumulative probabilities reach top_p or higher. Values range from 0.0 (exclusive) to 1.0 (inclusive). The default value of 1.0 includes all tokens, allowing maximum diversity.

xtc_threshold

number | null

A probability threshold used to identify “top choice” tokens for exclusion in XTC (Exclude Top Choices) sampling. Tokens with probabilities at or above this threshold are considered viable candidates, and all but the least likely viable token are excluded from sampling. This option reduces the dominance of highly probable tokens while preserving some diversity by keeping the least confident “top choice.” Values range from 0.0 (inclusive) to 1.0 (inclusive). Higher values make the filtering more selective by requiring higher probabilities to trigger exclusion, while lower values apply filtering more broadly. The default value of 0.0 disables XTC filtering entirely.

xtc_probability

number | null

The probability that XTC (Exclude Top Choices) filtering will be applied for each sampling decision. When XTC is triggered, high-probability tokens above the xtc_threshold are excluded except for the least likely viable token. This stochastic activation allows for a balance between standard sampling and creativity-boosting exclusion filtering. Values range from 0.0 (inclusive) to 1.0 (inclusive), where 0.0 means XTC is never applied, 1.0 means XTC is always applied when viable tokens exist, and intermediate values provide probabilistic activation. The default value of 0.0 disables XTC filtering.

tools

Tool · object[] | null

A list of tools the model may call. Use this to provide a list of functions the model may generate JSON inputs for.

When tools is specified, min_tokens and response_format fields are unsupported.

Show child attributes

tools.type

string

required

The type of the tool. Currently, only function is supported.

Allowed value: "function"

tools.function

Function · object

required

Show child attributes

tools.function.name

string

required

The name of the function to be called. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

tools.function.parameters

Parameters · object

required

The parameters the functions accepts, described as a JSON Schema object. To represent a function with no parameters, use the value {"type": "object", "properties": {}}.

tools.function.description

string | null

A description of what the function does, used by the model to choose when and how to call the function.

min_tokens

integer | null

The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens argument.

This field is unsupported when tools or response_format is specified.

response_format

Json Schema · object

The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens. You can check this by verifying that the finish_reason of the output message is length.

For more detailed information, please refer here.

Important You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.). Otherwise, the model may result in an unending stream of whitespace or other characters.

This field is unsupported when tools is specified. When response_format is specified, min_tokens field is unsupported.

Json Schema
Json Object
Regex
Text

Show child attributes

response_format.type

string

required

The type of the response format: json_schema

Allowed value: "json_schema"

response_format.json_schema

ResponseFormatJsonSchemaSchema · object

required

Show child attributes

response_format.json_schema.schema

Schema · object

required

The schema for the response format, described as a JSON Schema object.

Response

Successfully generated a chat response.

string

required

A unique ID of the chat completion.

choices

ChatChoice · object[]

required

Show child attributes

choices.index

integer

required

The index of the choice in the list of generated choices.

choices.message

ChatChoiceMessage · object

required

Show child attributes

choices.message.role

string

required

Role of the generated message author, in this case assistant.

choices.message.content

string | null

The contents of the assistant message.

choices.message.tool_calls

ToolCallResult · object[] | null

Show child attributes

choices.message.tool_calls.type

string

required

The type of the tool.

Allowed value: "function"

choices.message.tool_calls.id

string

required

The ID of the tool call.

choices.message.tool_calls.function

FunctionResult · object

required

Show child attributes

choices.message.tool_calls.function.arguments

string

required

The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON.

choices.message.tool_calls.function.name

string

required

The name of the function to call.

choices.finish_reason

enum<string>

required

Termination condition of the generation. stop means the API returned the full chat completions generated by the model without running into any limits. length means the generation exceeded max_tokens or the conversation exceeded the max context length. tool_calls means the API has generated tool calls.

Available options:

stop,

length,

tool_calls

choices.logprobs

ChatLogprobs · object

Show child attributes

choices.logprobs.content

ChatLogprobsContent · object[] | null

A list of message content tokens with log probability information.

Show child attributes

choices.logprobs.content.token

string

required

The token.

choices.logprobs.content.logprob

number

required

The log probability of this token.

choices.logprobs.content.top_logprobs

ChatLogprobsContentTopLogprob · object[]

required

List of the most likely tokens and their log probability, at this token position.

Show child attributes

choices.logprobs.content.top_logprobs.token

string

required

The token.

choices.logprobs.content.top_logprobs.logprob

number

required

The log probability of this token.

choices.logprobs.content.top_logprobs.bytes

integer[] | null

A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be null if there is no bytes representation for the token.

choices.logprobs.content.bytes

integer[] | null

usage

ChatUsage · object

required

Show child attributes

usage.prompt_tokens

integer

required

Number of tokens in the prompt.

usage.completion_tokens

integer

required

Number of tokens in the generated chat completions.

usage.total_tokens

integer

required

Total number of tokens used in the request (prompt_tokens + completion_tokens).

object

string

required

The object type, which is always set to chat.completion.

Allowed value: "chat.completion"

created

integer

required

The Unix timestamp (in seconds) for when the generation completed.

model

string | null

The model to generate the completion. For dedicated endpoints, it returns the endpoint id.

API Reference

Dedicated

Serverless

Container

Dataset & File

Friendli SDK

Dedicated chat completions

Authorizations

Headers

Body

Response