April 20, 2026
4 min read

GLM-5.1 on FriendliAI: The Long-Horizon Agentic Engineering Model at Peak Performance

TL;DR

GLM-5.1 by Z.ai is currently #1 open-weight model for agentic software engineering and long-horizon task execution.
This new model is exceeding the performance of Claude Opus 4.6 on coding benchmarks like SWE-Bench Pro and CyberGym.
The model is capable of improving its results, re-evaluating its thinking, and adapting its strategy after running for hours, over hundreds of iterations and thousands of tool calls.
According to Artificial Analysis and OpenRouter, FriendliAI delivers industry-leading performance for GLM-5.1 across output speed, latency, tool calling, and structured outputs, compared to other serverless model APIs.
We’re proud to collaborate with Z.ai as a Day 0 launch partner, providing Serverless Endpoints and Dedicated Endpoints for GLM-5.1.

GLM-5.1 on FriendliAI: The Long-Horizon Agentic Engineering Model at Peak Performance thumbnail

GLM-5.1 is Z.ai’s new open-weight, long-horizon agentic engineering model exceeding the performance of Claude Opus 4.6 on coding benchmarks like SWE-Bench Pro and CyberGym at a fraction of the cost. GLM-5.1 is equally capable of executing long-horizon tasks and improving the quality of responses after working for hours, over hundreds of iterations, and thousands of tool calls.

FriendliAI provides industry-leading performance compared to other serverless model APIs hosting this new frontier model, as measured by Artificial Analysis and OpenRouter. We’re proud to collaborate with Z.ai as a Day 0 launch partner, providing Serverless Endpoints and Dedicated Endpoints for GLM-5.1.

Try GLM-5.1 on FriendliAI now.

What’s Amazing About GLM-5.1

Long-Horizon Task Execution

GLM-5.1 shares a similar Mixture-of-Experts architecture with GLM-5, featuring approximately 744 billion total parameters and 40 billion active parameters. Whereas legacy models apply common techniques to deliver incremental performance improvements, they often stagnate after the first pass, even when reasoning is activated. By contrast, GLM-5.1 can re-evaluate its thinking and adapt its strategy through repeated iteration, sustaining optimization over hundreds of rounds and thousands of tool calls within eight hours. The model demonstrates sound judgment when addressing ambiguous challenges and consistently maintains productivity over extended periods of time.

Agentic Software Engineering

GLM-5.1 can divide problems into subtasks, test variables during experimentation, analyze results, and identify root causes – all of which are crucial for software engineering and agentic coding. Contrary to other models, its response quality improves over longer periods of time. GLM-5.1 ranks #1 in software engineering for open-weight models and #3 globally compared to GPT-5.4 and Claude Opus 4.6 across benchmarks, exceeding their performance on SWE-Bench Pro and CyberGym.

GLM-5.1’s performance across 3 coding benchmarks, published by Z.ai.

Industry-Leading Inference Performance for the Long-Horizon Agentic Engineering Model

FriendliAI serves high-performance inference for open-weight models, including GLM-5.1, on Serverless and Dedicated Endpoints – leading many categories in public leaderboards on Artificial Analysis and OpenRouter.

Output Speed

According to Artificial Analysis, FriendliAI delivers the greatest number of output tokens per second for GLM-5.1, compared to all other inference providers hosting the model.

Output speed for GLM-5.1 (reasoning), reported by Artificial Analysis on April 20, 2026

Time-to-First-Token Latency

FriendliAI ranks #1 among inference providers for the lowest time-to-first-token latency, measured in seconds. As the chart by Artificial Analysis shows, low latency is optimal.

Time-to-first-token latency for GLM-5.1 (reasoning), reported by Artificial Analysis on April 20, 2026.

End-to-End Response Times

FriendliAI also delivers the lowest end-to-end response times, which include time to process 10,000 input tokens, thinking time (when reasoning is enabled), and time to output 500 tokens. See the chart published by Artificial Analysis below.

End-to-end-response times for GLM-5.1 (reasoning), reported by Artificial Analysis on April 20, 2026.

Tool Calling and Structured Outputs

According to OpenRouter, FriendliAI is the highest rated inference provider for tool calling and structured outputs, with the lowest error rates across both categories. Note that FriendliAI is one of the few inference providers that supports tool calling and structured outputs for GLM-5.1.

Tool Call Error Rates for GLM-5.1, reported by OpenRouter on April 20, 2026.

Structured Output Error Rates for GLM-5.1, reported by OpenRouter on April 20, 2026.

Run GLM-5.1 on FriendliAI

Getting Started

To deploy GLM-5.1 on Serverless Endpoints…

Create a Friendli Suite account
Select GLM-5.1 in our model catalog
Create an API key for your Serverless Endpoints
Configure your deployment
Save your Friendli API key

Serverless Endpoints are priced by token – with $1.40 per million input tokens, $0.26 per million cached input tokens, and $4.40 per million output tokens. For pricing on Dedicated Endpoints with compute at scale, please request to speak with a Friendli engineer.

Example: Web Application Development

In the following example, GLM-5.1 has been tasked with developing a web application from a natural language prompt that can run on a browser. Here’s how you can try it, too.

Enter the API key

Add the Friendli API key directly in your CLI to keep it secure.

shell

export FRIENDLI_API_KEY="your-api-key-here"

Sample Request

Run this script to build a fully functional web application as a single HTML file.

python

import os

from openai import OpenAI

client = OpenAI(

api_key=os.environ["FRIENDLI_API_KEY"],

base_url="https://api.friendli.ai/serverless/v1",

)

stream = client.chat.completions.create(

model="zai-org/GLM-5.1",

messages=[

{"role": "system",

"content": (

"You are an expert web developer. "

"Given a natural language prompt, generate a fully functional "

"single-page web application using HTML, CSS, and JavaScript. "

"Output only valid, self-contained code with no external dependencies."

)},

{"role": "user",

"content": (

"Build a single-page web application with "

"a responsive navigation bar, a main content area, "

"a sidebar, and a dark theme using CSS variables."

)},

],

stream=True,

)

for chunk in stream:

print(chunk.choices[0].delta.content or "", end="", flush=True)

Sample Response

Here are the opening lines of a successful response, followed by a lengthy HTML output.

html

<!DOCTYPE html>

<html lang="en">

<head>

<meta charset="UTF-8">

<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Nexus Dashboard</title>

Direct the Output to HTML

Direct the HTML output to a browser and open the web application in your browser.

bash

python "YOUR_FILE.PY" > sample-app.html	# Enter the name of your file for your Python script.
open sample-app.html	# The open command is specific to Mac. Use start for Windows or xdg-open for Linux

This is the resulting output from GLM-5.1, a functional and well-designed web app.

Web app developed by GLM-5.1 using a simple natural language prompt.

Try GLM-5.1 on FriendliAI

GLM-5.1 is the #1 open-weight model for long-horizon agentic engineering, and FriendliAI leads in performance metrics across throughput, latency, tool calling, and structured outputs on public leaderboards published by OpenRouter and Artificial Analysis. Try it now on our Serverless Endpoints, or contact our team to reserve large-scale capacity for the model with our Dedicated Endpoints.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 580,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.