Agents
Run multi-step, tool-using agents that stay fast, coherent, and cost-efficient across thousands of chained calls.

problem
Chained reasoning compounds latency and cost
Unbounded context growth
Agent memory, tool outputs, and intermediate results accumulate. Most providers degrade or truncate, causing agents to lose coherence mid-task.
Compounded latency
A single task often triggers 6–15 model calls. Modest per-call overhead compounds into seconds of added latency.
No graceful recovery
Agents left running for hours encounter timeouts and dropped generations, and the entire job must restart.
Unpredictable costs
Stateful, tool-using agents consume far more tokens than single-turn requests. Without efficient serving, economics break down as complexity grows.

solution
FriendliAI keeps agents fast, coherent, and cost-efficient
Reliable long-context handling
Memory-efficient KV cache sustains full context fidelity as history grows, eliminating truncation and lost state.
Low-latency token generation across chained calls
Speculative decoding and an optimized pipeline minimize per-call latency.
No dropped generations mid-task
Efficient KV-cache management combined with continuous batching ensures uninterrupted outputs with no timeouts or dropped generations.
Cost-efficient execution
Continuous batching and high GPU utilization keep per-token costs low as task complexity, token volume, and concurrent sessions scale.
Open models are made for agents
FriendliAI supports the leading open models purpose-built for agentic workloads — optimized for multi-step reasoning, tool use, and long-context execution out of the box.
Have a custom or fine-tuned model?
We'll help you deploy it just as easily. Contact us to deploy your model.
How teams scale with FriendliAI
Learn how leading companies achieve unmatched performance, scalability, and reliability with FriendliAI
Our custom model API went live in about a day with enterprise-grade monitoring built in.
Scale to trillions of tokens with 50% fewer GPUs, thanks to FriendliAI.
Rock-solid reliability with ultra-low tail latency.
Cutting GPU costs accelerated our path to profitability.
Fluctuating traffic is no longer a concern because autoscaling just works.
Our custom model API went live in about a day with enterprise-grade monitoring built in.
Scale to trillions of tokens with 50% fewer GPUs, thanks to FriendliAI.
Rock-solid reliability with ultra-low tail latency.
Cutting GPU costs accelerated our path to profitability.
Fluctuating traffic is no longer a concern because autoscaling just works.
Resources
Docs, demos, and resources for agents.



