Completions
Generate text based on the given text prompt.
See available models at this pricing table.
To successfully run an inference request, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token.
When streaming mode is used (i.e., stream
option is set to true
), the response is in MIME type text/event-stream
. Otherwise, the content type is application/json
.
You can view the schema of the streamed sequence of chunk objects in streaming mode here.
Authorizations
Headers
ID of team to run requests as (optional parameter).
Body
ID of target endpoint. If you want to send request to specific adapter, using "ENDPOINT_ID:ADAPTER_ROUTE" format.
The prompt (i.e., input text) to generate completions for. Either prompt
or tokens
field is required.
Same as the above bad_words
field, but receives token sequences instead of text phrases. This is similar to Hugging Face's bad_word_ids
argument.
Text phrases that should not be generated.
For a bad word phrase that contains N tokens, if the first N-1 tokens appears at the last of the generated result, the logit for the last token of the phrase is set to -inf.
Before checking whether a bard word is included in the result, the word is converted into tokens.
We recommend using bad_word_tokens
because it is clearer.
For example, after tokenization, phrases "clear" and " clear" can result in different token sequences due to the prepended space character.
Defaults to empty list.
One of DETERMINISTIC
, NAIVE_SAMPLING
, and STOCHASTIC
. Which beam search type to use. DETERMINISTIC
means the standard, deterministic beam search, which is similar to Hugging Face's beam_search
. Arguments for controlling random sampling such as top_k
and top_p
are not allowed for this option. NAIVE_SAMPLING
is similar to Hugging Face's beam_sample
. STOCHASTIC
means stochastic beam search (more details in Kool et al. (2019)). This option is ignored if num_beams
is not provided. Defaults to DETERMINISTIC
.
Whether to stop the beam search when at least num_beams
beams are finished with the EOS token. Only allowed for beam search. Defaults to false. This is similar to Hugging Face's early_stopping
argument.
A list of flattened embedding vectors used for replacing the tokens at the specified indices provided via token_index_to_replace
.
If this exceeds 1, every ngram of that size occurring in the input token sequence cannot appear in the generated result. 1 means that this mechanism is disabled (i.e., you cannot prevent 1-gram from being generated repeatedly). Only allowed for encoder-decoder models. Defaults to 1. This is similar to Hugging Face's encoder_no_repeat_ngram_size
argument.
Penalizes tokens that have already appeared in the input tokens. Should be greater than or equal to 1.0. 1.0 means no penalty. Only allowed for encoder-decoder models. See Keskar et al., 2019 for more details. This is similar to Hugging Face's encoder_repetition_penalty
argument.
A list of endpoint sentence tokens.
Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.
Coefficient for exponential length penalty that is used with beam search. Only allowed for beam search. Defaults to 1.0. This is similar to Hugging Face's length_penalty
argument.
Include the log probabilities on the logprobs most likely output tokens, as well the chosen tokens.
The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens
should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens
should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens
argument.
The maximum number of tokens including both the generated result and the input tokens. Only allowed for decoder-only models. Only one argument between max_tokens
and max_total_tokens
is allowed. Default value is the model's maximum length. This is similar to Hugging Face's max_length
argument.
The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens
argument.
This field is unsupported when response_format
is specified.
The minimum number of tokens including both the generated result and the input tokens. Only allowed for decoder-only models. Only one argument between min_tokens
and min_total_tokens
is allowed. This is similar to Hugging Face's min_length
argument.
The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences
argument.
If this exceeds 1, every ngram of that size can only occur once among the generated result (plus the input tokens for decoder-only models). 1 means that this mechanism is disabled (i.e., you cannot prevent 1-gram from being generated repeatedly). Defaults to 1. This is similar to Hugging Face's no_repeat_ngram_size
argument.
Number of beams for beam search. Numbers between 1 and 31 (both inclusive) are allowed. Default behavior is no beam search. This is similar to Hugging Face's num_beams
argument.
Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.
Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be greater than or equal to 1.0 (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty
argument.
The enforced format of the model's output.
Note that the content of the output message may be truncated if it exceeds the max_tokens
.
You can check this by verifying that the finish_reason
of the output message is length
.
Important
You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.
).
Otherwise, the model may result in an unending stream of whitespace or other characters.
When response_format
is specified, min_tokens
field is unsupported.
Seed to control random procedure. If nothing is given, the API generate the seed randomly, use it for sampling, and return the seed along with the generated result. When using the n
argument, you can pass a list of seed values to control all of the independent generations.
When one of the stop phrases appears in the generation result, the API will stop generation.
The stop phrases are excluded from the result.
This option is incompatible with beam search (specified by num_beams
); use stop_tokens
for that case instead.
Defaults to empty list.
Stop generating further tokens when generated token corresponds to any of the tokens in the sequence. If beam search is enabled, all of the active beams should contain the stop token to terminate generation.
Whether to stream generation result. When set true, each token will be sent as server-sent events once generated. Not supported when using beam search.
Options related to stream.
It can only be used when stream: true
.
Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1
) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature
argument.
Request timeout. Gives the HTTP 429 Too Many Requests
response status code. Default behavior is no timeout.
A list of token indices where to replace the embeddings of input tokens provided via either tokens
or prompt
.
The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k
argument.
Was this page helpful?