POST
/
dedicated
/
v1
/
chat
/
completions

See available models at this pricing table.

To successfully run an inference request, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token.

When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json. You can view the schema of the streamed sequence of chunk objects in streaming mode here.

Authorizations

Authorization
string
header
required

When using Friendli Suite API for inference requests, you need to provide a Friendli Token for authentication and authorization purposes.

For more detailed information, please refer here.

Headers

X-Friendli-Team
string | null

ID of team to run requests as (optional parameter).

Body

application/json
messages
object[]
required

A list of messages comprising the conversation so far.

model
string
required

ID of target endpoint. If you want to send request to specific adapter, use the format "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE". Otherwise, you can just use "YOUR_ENDPOINT_ID" alone.

eos_token
integer[] | null

A list of endpoint sentence tokens.

frequency_penalty
number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

logit_bias
object | null

Accepts a JSON object that maps tokens to an associated bias value. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model.

logprobs
boolean | null

Whether to return log probabilities of the output tokens or not.

max_tokens
integer | null

The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

min_p
number | null

A scaling factor used to determine the minimum token probability threshold. This threshold is calculated as min_p multiplied by the probability of the most likely token. Tokens with probabilities below this scaled threshold are excluded from sampling. Values range from 0.0 (inclusive) to 1.0 (inclusive). Higher values result in stricter filtering, while lower values allow for greater diversity. The default value of 0.0 disables filtering, allowing all tokens to be considered for sampling.

min_tokens
integer | null

The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens argument.

This field is unsupported when tools or response_format is specified.

n
integer | null

The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

parallel_tool_calls
boolean | null

Whether to enable parallel function calling.

presence_penalty
number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

repetition_penalty
number | null

Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be positive value (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty argument.

response_format
object | null

The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens. You can check this by verifying that the finish_reason of the output message is length.

For more detailed information, please refer here.

Important You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.). Otherwise, the model may result in an unending stream of whitespace or other characters.

This field is unsupported when tools is specified. When response_format is specified, min_tokens field is unsupported.

seed

Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.

stop
string[] | null

When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.

stream
boolean | null
default:
false

Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.

stream_options
object | null

Options related to stream. It can only be used when stream: true.

temperature
number | null

Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature argument.

timeout_microseconds
integer | null

Request timeout. Gives the HTTP 429 Too Many Requests response status code. Default behavior is no timeout.

tool_choice

Determines the tool calling behavior of the model. When set to none, the model will bypass tool execution and generate a response directly. In auto mode (the default), the model dynamically decides whether to call a tool or respond with a message. Alternatively, setting required ensures that the model invokes at least one tool before responding to the user. You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}.

tools
object[] | null

A list of tools the model may call. Currently, only functions are supported as a tool. A maximum of 128 functions is supported. Use this to provide a list of functions the model may generate JSON inputs for.

When tools is specified, min_tokens and response_format fields are unsupported.

top_k
integer | null

Limits sampling to the top k tokens with the highest probabilities. Values range from 0 (no filtering) to the model's vocabulary size (inclusive). The default value of 0 applies no filtering, allowing all tokens.

top_logprobs
integer | null

The number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.

top_p
number | null

Keeps only the smallest set of tokens whose cumulative probabilities reach top_p or higher. Values range from 0.0 (exclusive) to 1.0 (inclusive). The default value of 1.0 includes all tokens, allowing maximum diversity.

Response

200
application/json
Successfully generated a chat response.
choices
object[]
required
created
integer
required

The Unix timestamp (in seconds) for when the generation completed.

id
string
required

A unique ID of the chat completion.

object
string
required

The object type, which is always set to chat.completion.

Allowed value: "chat.completion"
usage
object
required