- April 20, 2026
- 4 min read
GLM-5.1 on FriendliAI: The Long-Horizon Agentic Engineering Model at Peak Performance
- GLM-5.1 by Z.ai is currently #1 open-weight model for agentic software engineering and long-horizon task execution.
- This new model is exceeding the performance of Claude Opus 4.6 on coding benchmarks like SWE-Bench Pro and CyberGym.
- The model is capable of improving its results, re-evaluating its thinking, and adapting its strategy after running for hours, over hundreds of iterations and thousands of tool calls.
- According to Artificial Analysis and OpenRouter, FriendliAI delivers industry-leading performance for GLM-5.1 across output speed, latency, tool calling, and structured outputs, compared to other serverless model APIs.
- We’re proud to collaborate with Z.ai as a Day 0 launch partner, providing Serverless Endpoints and Dedicated Endpoints for GLM-5.1.

GLM-5.1 is Z.ai’s new open-weight, long-horizon agentic engineering model exceeding the performance of Claude Opus 4.6 on coding benchmarks like SWE-Bench Pro and CyberGym at a fraction of the cost. GLM-5.1 is equally capable of executing long-horizon tasks and improving the quality of responses after working for hours, over hundreds of iterations, and thousands of tool calls.
FriendliAI provides industry-leading performance compared to other serverless model APIs hosting this new frontier model, as measured by Artificial Analysis and OpenRouter. We’re proud to collaborate with Z.ai as a Day 0 launch partner, providing Serverless Endpoints and Dedicated Endpoints for GLM-5.1.
Try GLM-5.1 on FriendliAI now.
What’s Amazing About GLM-5.1
Long-Horizon Task Execution
GLM-5.1 shares a similar Mixture-of-Experts architecture with GLM-5, featuring approximately 744 billion total parameters and 40 billion active parameters. Whereas legacy models apply common techniques to deliver incremental performance improvements, they often stagnate after the first-pass, even when reasoning is activated. By contrast, GLM-5.1 can re-evaluate its thinking and adapt its strategy through repeated iteration, sustaining optimization over hundreds of rounds and thousands of tool calls within an eight-hour period. The model practices superior judgment when addressing ambiguous challenges and continuously maintains productivity over extended periods of time.
Agentic Software Engineering
GLM-5.1 can divide problems into subtasks, test variables during experimentation, analyze results, and identify root causes – all of which are crucial for software engineering and agentic coding. Contrary to other models, its response quality improves over longer periods of time. GLM-5.1 ranks #1 in software engineering for open-weight models and #3 globally compared to GPT-5.4 and Claude Opus 4.6 across benchmarks, exceeding their performance on SWE-Bench Pro and CyberGym.

Industry-Leading Inference Performance for the Long-Horizon Agentic Engineering Model
FriendliAI serves high-performance inference for open-weight models, including GLM-5.1, on Serverless and Dedicated Endpoints – leading many categories in public leaderboards on Artificial Analysis and OpenRouter.
Output Speed
According to Artificial Analysis, FriendliAI delivers the greatest number of output tokens per second for GLM-5.1, compared to all other inference providers hosting the model.

Time-to-First-Token Latency
FriendliAI ranks #1 among inference providers for the lowest time-to-first-token latency, measured in seconds. As the chart by Artificial Analysis shows, low latency is optimal.

End-to-End Response Times
FriendliAI also delivers the lowest end-to-end response times, which include time to process 10,000 input tokens, thinking time (when reasoning is enabled), and time to output 500 tokens. See the chart published by Artificial Analysis below.

Tool Calling and Structured Outputs
According to OpenRouter, FriendliAI is the most highly rated inference provider for tool calling and structured outputs, with the lowest error rates across both categories. Note that FriendliAI is one of the few inference providers that supports tool calling and structured outputs for GLM-5.1.


Run GLM-5.1 on FriendliAI
Getting Started
To deploy GLM-5.1 on Serverless Endpoints…
- Create a Friendli account
- Select GLM-5.1 in our model catalog
- Create an API key for your Serverless Endpoints
- Configure your deployment
- Save your Friendli API key
Serverless Endpoints are priced by token – with $1.40 per million input tokens, $0.26 per million cached input tokens, and $4.40 per million output tokens. For pricing on Dedicated Endpoints with compute at scale, please request to speak with a Friendli engineer.
Example: Web Application Development
In the following example, GLM-5.1 has been tasked with developing a web application from a natural language prompt that can run on a browser. Here’s how you can try it, too.
Enter the API key
Add the Friendli API key directly in your CLI to keep it secure.
Sample Request
Run this script to build a fully functional web application as a single HTML file.
Sample Response
Here are the opening lines of a successful response followed by a lengthy HTML output.
Direct the Output to HTML
Direct the HTML output to a browser and open the web application in your browser.
This is the resulting output from GLM-5.1, a functional and well-designed web app.

Try GLM-5.1 on FriendliAI
GLM-5.1 is the #1 open-weight model for long-horizon agentic engineering, and FriendliAI leads in performance metrics across throughput, latency, tool calling, and structured outputs on public leaderboards published by OpenRouter and Artificial Analysis. Try it now on our Serverless Endpoints, or contact our team to reserve large-scale capacity for the model with our Dedicated Endpoints.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 540,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

