Dedicated completions

Generate text based on the given text prompt. To request successfully, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token. When streaming mode is used (i.e., stream option is set to true), the response is in MIME type text/event-stream. Otherwise, the content type is application/json. You can view the schema of the streamed sequence of chunk objects in streaming mode here.

Authorizations

Authorization

string

header

required

When using Friendli Suite API for inference requests, you need to provide a Friendli Token for authentication and authorization purposes.

For more detailed information, please refer here.

Headers

X-Friendli-Team

string | null

ID of team to run requests as (optional parameter).

Body

application/json

CompletionsDedicatedBodyWithPrompt
CompletionsDedicatedBodyWithTokens

model

string

required

ID of target endpoint. If you want to send request to specific adapter, use the format "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE". Otherwise, you can just use "YOUR_ENDPOINT_ID" alone.

Example:

"(endpoint-id)"

prompt

required

The prompt (i.e., input text) to generate completions for. Either prompt or tokens field is required.

Example:

"Say this is a test!"

bad_word_tokens

TokenSequence · object[] | null

Same as the above bad_words field, but receives token sequences instead of text phrases. This is similar to Hugging Face's bad_word_ids argument.

Show child attributes

bad_words

string[] | null

Text phrases that should not be generated. For a bad word phrase that contains N tokens, if the first N-1 tokens appears at the last of the generated result, the logit for the last token of the phrase is set to -inf. Before checking whether a bard word is included in the result, the word is converted into tokens. We recommend using bad_word_tokens because it is clearer. For example, after tokenization, phrases "clear" and " clear" can result in different token sequences due to the prepended space character. Defaults to empty list.

embedding_to_replace

number[] | null

A list of flattened embedding vectors used for replacing the tokens at the specified indices provided via token_index_to_replace.

encoder_no_repeat_ngram

integer | null

If this exceeds 1, every ngram of that size occurring in the input token sequence cannot appear in the generated result. 1 means that this mechanism is disabled (i.e., you cannot prevent 1-gram from being generated repeatedly). Only allowed for encoder-decoder models. Defaults to 1. This is similar to Hugging Face's encoder_no_repeat_ngram_size argument.

encoder_repetition_penalty

number | null

Penalizes tokens that have already appeared in the input tokens. Should be positive value. 1.0 means no penalty. Only allowed for encoder-decoder models. See Keskar et al., 2019 for more details. This is similar to Hugging Face's encoder_repetition_penalty argument.

eos_token

integer[] | null

A list of endpoint sentence tokens.

forced_output_tokens

integer[] | null

A token sequence that is enforced as a generation output. This option can be used when evaluating the model for the datasets with multi-choice problems (e.g., HellaSwag, MMLU). Use this option with logprobs to get logprobs for the evaluation.

frequency_penalty

number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.

logprobs

integer | null

Include the log probabilities on the logprobs most likely output tokens, as well the chosen tokens.

max_tokens

integer | null

The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens argument.

Example:

200

max_total_tokens

integer | null

The maximum number of tokens including both the generated result and the input tokens. Only allowed for decoder-only models. Only one argument between max_tokens and max_total_tokens is allowed. Default value is the model's maximum length. This is similar to Hugging Face's max_length argument.

min_p

number | null

A scaling factor used to determine the minimum token probability threshold. This threshold is calculated as min_p multiplied by the probability of the most likely token. Tokens with probabilities below this scaled threshold are excluded from sampling. Values range from 0.0 (inclusive) to 1.0 (inclusive). Higher values result in stricter filtering, while lower values allow for greater diversity. The default value of 0.0 disables filtering, allowing all tokens to be considered for sampling.

min_tokens

integer | null

The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens argument.

This field is unsupported when response_format is specified.

min_total_tokens

integer | null

The minimum number of tokens including both the generated result and the input tokens. Only allowed for decoder-only models. Only one argument between min_tokens and min_total_tokens is allowed. This is similar to Hugging Face's min_length argument.

integer | null

The number of independently generated results for the prompt. Defaults to 1. This is similar to Hugging Face's num_return_sequences argument.

no_repeat_ngram

integer | null

If this exceeds 1, every ngram of that size can only occur once among the generated result (plus the input tokens for decoder-only models). 1 means that this mechanism is disabled (i.e., you cannot prevent 1-gram from being generated repeatedly). Defaults to 1. This is similar to Hugging Face's no_repeat_ngram_size argument.

presence_penalty

number | null

Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.

repetition_penalty

number | null

Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be positive value (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty argument.

response_format

Json Schema · object

The enforced format of the model's output.

Note that the content of the output message may be truncated if it exceeds the max_tokens. You can check this by verifying that the finish_reason of the output message is length.

For more detailed information, please refer here.

Important You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.). Otherwise, the model may result in an unending stream of whitespace or other characters.

When response_format is specified, min_tokens field is unsupported.

Json Schema
Json Object
Regex
Text

Show child attributes

seed

Seed to control random procedure. If nothing is given, the API generate the seed randomly, use it for sampling, and return the seed along with the generated result. When using the n argument, you can pass a list of seed values to control all of the independent generations.

stop

string[] | null

When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.

stop_tokens

TokenSequence · object[] | null

Stop generating further tokens when generated token corresponds to any of the tokens in the sequence.

Show child attributes

stream

boolean | null

default:false

Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.

stream_options

StreamOptions · object

Options related to stream. It can only be used when stream: true.

Show child attributes

temperature

number | null

Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature argument.

token_index_to_replace

integer[] | null

A list of token indices where to replace the embeddings of input tokens provided via either tokens or prompt.

top_k

integer | null

Limits sampling to the top k tokens with the highest probabilities. Values range from 0 (no filtering) to the model's vocabulary size (inclusive). The default value of 0 applies no filtering, allowing all tokens.

Example:

1

top_p

number | null

Keeps only the smallest set of tokens whose cumulative probabilities reach top_p or higher. Values range from 0.0 (exclusive) to 1.0 (inclusive). The default value of 1.0 includes all tokens, allowing maximum diversity.

xtc_threshold

number | null

A probability threshold used to identify “top choice” tokens for exclusion in XTC (Exclude Top Choices) sampling. Tokens with probabilities at or above this threshold are considered viable candidates, and all but the least likely viable token are excluded from sampling. This option reduces the dominance of highly probable tokens while preserving some diversity by keeping the least confident “top choice.” Values range from 0.0 (inclusive) to 1.0 (inclusive). Higher values make the filtering more selective by requiring higher probabilities to trigger exclusion, while lower values apply filtering more broadly. The default value of 0.0 disables XTC filtering entirely.

xtc_probability

number | null

The probability that XTC (Exclude Top Choices) filtering will be applied for each sampling decision. When XTC is triggered, high-probability tokens above the xtc_threshold are excluded except for the least likely viable token. This stochastic activation allows for a balance between standard sampling and creativity-boosting exclusion filtering. Values range from 0.0 (inclusive) to 1.0 (inclusive), where 0.0 means XTC is never applied, 1.0 means XTC is always applied when viable tokens exist, and intermediate values provide probabilistic activation. The default value of 0.0 disables XTC filtering.

Response

Successfully generated completions.

string

required

A unique ID of the completion.

object

string

required

The object type, which is always set to text_completion.

Allowed value: "text_completion"

usage

TextUsage · object

required

Show child attributes

choices

CompletionsChoice · object[]

required

Show child attributes

model

string | null

The model to generate the completion. For dedicated endpoints, it returns the endpoint id.

API Reference

Dedicated

Serverless

Container

Dataset & File

Friendli SDK

Authorizations

Headers

Body

Response