Tool assisted chat completions (Beta)
Given a list of messages forming a conversation, the model generates a response. Additionally, the model can utilize built-in tools for tool calls, enhancing its capability to provide more comprehensive and actionable responses.
See available models at this pricing table.
To successfully run an inference request, it is mandatory to enter a Friendli Token (e.g. flp_XXX) value in the Bearer Token field. Refer to the authentication section on our introduction page to learn how to acquire this variable and visit here to generate your token.
When streaming mode is used (i.e., stream
option is set to true
), the response is in MIME type text/event-stream
. Otherwise, the content type is application/json
.
You can view the schema of the streamed sequence of chunk objects in streaming mode here.
This API is currently in Beta. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release.
Authorizations
When using Friendli Endpoints API for inference requests, you need to provide a Friendli Token for authentication and authorization purposes.
For more detailed information, please refer here.
Headers
ID of team to run requests as (optional parameter).
Body
Code of the model to use. See available model list.
A list of messages comprising the conversation so far.
A list of endpoint sentence tokens.
Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim.
The maximum number of tokens to generate. For decoder-only models like GPT, the length of your input tokens plus max_tokens
should not exceed the model's maximum length (e.g., 2048 for OpenAI GPT-3). For encoder-decoder models like T5 or BlenderBot, max_tokens
should not exceed the model's maximum output length. This is similar to Hugging Face's max_new_tokens
argument.
The minimum number of tokens to generate. Default value is 0. This is similar to Hugging Face's min_new_tokens
argument.
This field is unsupported when tools
are specified.
The number of independently generated results for the prompt. Not supported when using beam search. Defaults to 1. This is similar to Hugging Face's num_return_sequences
argument.
Whether to enable parallel function calling.
Number between -2.0 and 2.0. Positive values penalizes tokens that have been sampled at least once in the existing text.
Penalizes tokens that have already appeared in the generated result (plus the input tokens for decoder-only models). Should be greater than or equal to 1.0 (1.0 means no penalty). See keskar et al., 2019 for more details. This is similar to Hugging Face's repetition_penalty
argument.
The enforced format of the model's output.
Note that the content of the output message may be truncated if it exceeds the max_tokens
.
You can check this by verifying that the finish_reason
of the output message is length
.
Important
You must explicitly instruct the model to produce the desired output format using a system prompt or user message (e.g., You are an API generating a valid JSON as output.
).
Otherwise, the model may result in an unending stream of whitespace or other characters.
Enable to continue text generation even after an error occurs during a tool call.
Note that enabling this option may use more tokens, as the system generates additional content to handle errors gracefully. However, if the system fails more than 8 times, the generation will stop regardless.
Tip This is useful in scenarios where you want to maintain text generation flow despite errors, such as when generating long-form content. The user will not be interrupted by tool call issues, ensuring a smoother experience.
Seed to control random procedure. If nothing is given, random seed is used for sampling, and return the seed along with the generated result. When using the n
argument, you can pass a list of seed values to control all of the independent generations.
When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Defaults to empty list.
Whether to stream generation result. When set true, each token will be sent as server-sent events once generated.
Sampling temperature. Smaller temperature makes the generation result closer to greedy, argmax (i.e., top_k = 1
) sampling. Defaults to 1.0. This is similar to Hugging Face's temperature
argument.
Request timeout. Gives the HTTP 429 Too Many Requests
response status code. Default behavior is no timeout.
Determines the tool calling behavior of the model.
When set to none
, the model will bypass tool execution and generate a response directly.
In auto
mode (the default), the model dynamically decides whether to call a tool or respond with a message.
Alternatively, setting required
ensures that the model invokes at least one tool before responding to the user.
You can also specify a particular tool by {"type": "function", "function": {"name": "my_function"}}
.
A list of tools the model may call. A maximum of 128 functions is supported. Use this to provide a list of functions the model may generate JSON inputs for. For more detailed information about each tool, please refer here.
When tools
are specified, min_tokens
field is unsupported.
The number of highest probability tokens to keep for sampling. Numbers between 0 and the vocab size of the model (both inclusive) are allowed. The default value is 0, which means that the API does not apply top-k filtering. This is similar to Hugging Face's top_k
argument.
Tokens comprising the top top_p
probability mass are kept for sampling. Numbers between 0.0 (exclusive) and 1.0 (inclusive) are allowed. Defaults to 1.0. This is similar to Hugging Face's top_p
argument.
Response
A server-sent event containing chat completions content.
Was this page helpful?