May 19, 2026
5 min read

Accelerating Inference on Friendli Dedicated Endpoints with Draft-Model Speculative Decoding

Q: What is FriendliAI?

FriendliAI is the Frontier Inference Cloud for Agents, delivering high throughput, low latency, and reliability at scale for agentic workloads. Through vertically optimized inference infrastructure, it delivers 2–5× faster output token speed and a 99.99% uptime SLA for high-volume production traffic.

TL;DR

Draft-model speculative decoding is now available on Friendli Dedicated Endpoints with a single toggle at endpoint creation and no code changes.
A smaller draft model proposes candidate tokens that the target model verifies in a single parallel forward pass, preserving output quality.
FriendliAI trains and automatically pairs draft models with supported targets including Gemma-4-31b-it, Kimi-K2.6, Qwen3.6-27B, DeepSeek-V3.2, MiniMax-M2.5, GLM-5, and GLM-5.1.
Unlike N-gram speculative decoding, the draft model generalizes beyond literal repetition, delivering accurate candidates on diverse, open-ended workloads.
Best suited for agentic pipelines, long-form reasoning and structured outputs, and code completion where sequential decoding dominates latency.

Accelerating Inference on Friendli Dedicated Endpoints with Draft-Model Speculative Decoding thumbnail

Large language models typically generate text autoregressively, producing one token at a time through repeated decoding steps. As generations become longer, these sequential decoding iterations increasingly dominate inference latency. Speculative decoding mitigates this bottleneck by proposing multiple candidate tokens ahead of time and verifying them in parallel, reducing the number of expensive target-model decoding passes during generation.

Draft-model speculative decoding uses a smaller, faster draft model to propose candidate tokens that the target model verifies in parallel. This capability is now available on Friendli Dedicated Endpoints alongside training-free approach such as N-gram speculative decoding and can be enabled with a simple toggle in the user interface. For supported models, it can increase output speed (i.e., reduce time-per-output-token) through the use of pre-trained draft models without requiring application code modifications.

Enable speculative decoding through a toggle in the user interface

How draft-model speculative decoding works

Standard autoregressive decoding runs one target-model forward pass per generated token, making generation fundamentally sequential. Draft-model speculative decoding breaks this loop: a smaller, faster draft model proposes multiple subsequent tokens, and the target model verifies them in a single parallel forward pass over the candidate sequence.
Each proposed token is accepted or rejected according to the target's own next-token distribution, using a verification rule designed to preserve the target model’s original output distribution. When draft tokens are accepted, the system emits multiple tokens from a single target-model forward pass. When rejected, the remaining candidates are discarded, and decoding resumes from a token sampled from the adjusted residual distribution.

Because the draft model is trained on the target model’s output distribution, it proposes candidates the target is likely to accept, maintaining useful acceptance rates across diverse workloads. FriendliAI trains and automatically pairs draft models with supported target models, eliminating the need for additional training and extra models to manage.

Draft model proposes candidate tokens that the target model verifies in a single parallel forward pass for faster output with similar quality.

This approach works because LLM inference at low batch sizes is typically memory-bandwidth-bound: each target-model forward pass spends much of its time loading weights from GPU memory, with compute units underutilized. Because the draft model is much smaller, generating candidate tokens requires far less memory bandwidth than a target-model pass. The target model then verifies those candidates in a single parallel forward pass at roughly the cost of a standard single-token decoding step, allowing the same memory traffic to produce multiple accepted tokens. This is where the increase in output speed comes from for latency-bound workloads. The size of the gain depends on how predictable the target model’s outputs are: workloads with repetitive or structured token patterns yield high acceptance rates and large speedups, while highly variable, open-ended outputs accept fewer draft tokens and see smaller gains. At higher batch sizes, when the target model becomes compute-saturated, the performance gains naturally narrow.

Draft-model speculative decoding is particularly well suited for:

Agentic pipelines where LLM calls chain together and latency compounds across steps
Long-form generation — including reasoning chains, structured outputs (JSON, code, markdown), and summarization — where repetitive token patterns increase acceptance rates and per-token savings accumulate over long outputs
Code completion and IDE assistants, where single-user sessions, tight latency budgets, and highly repetitive token patterns combine to deliver some of the strongest gains

Currently, the following models support draft-model speculative decoding on Friendli Dedicated Endpoints:

Draft-model vs. N-gram speculative decoding: choosing the right method

While draft-model speculative decoding uses a trained network to predict candidate tokens, N-gram speculative decoding leverages recurring token patterns in the target model’s own output. The system scans the prompt and generated output so far for repeated N-token sequences (where N denotes the prefix match length). When a match is found, the tokens that appeared after that sequence are used as speculative candidates without requiring an additional model. The target model verifies those candidates in a single parallel forward pass, accepting the matching prefix and discarding the remaining tokens. On Friendli Dedicated Endpoints, you select one speculative decoding strategy based on the characteristics of your workload.

N-gram speculative decoding performs best on highly structured and repetitive outputs where repeated token sequences are common and pattern matching introduces negligible overhead. As output diversity increases, N-gram match rates decline and the performance gains gradually diminish. On poorly matched workloads, occasional speculation rejections can even introduce slight latency compared to standard autoregressive decoding.

Draft-model speculative decoding generalizes beyond literal repetition because the predictor is learned from the target model’s output distribution, so it can propose accurate candidates even when the exact token sequence has not appeared before. N-gram speculative decoding, however, is compatible with any model on Dedicated Endpoints and requires no additional model. You can configure a maximum N-gram size between 1 and 10 during endpoint creation, with 3 recommended as a practical starting point.

Table with columns: Draft-model speculative decoding, N-gram speculative decoding
	Draft-model speculative decoding	N-gram speculative decoding
Mechanism	A draft model proposes candidate tokens, which the target model verifies in parallel	Previously observed repeating token patterns are reused to predict continuations
Optimal use cases	Agentic pipelines, code completion and IDE assistants, long-form generation with reasoning chains or structured outputs and summarization	Structured and repetitive outputs: code generation, and templated reports
Impact on inference performance	Increases output speed (i.e., reduces TPOT) in memory-bandwidth-bound regimes (low-batch serving)	Delivers output speedups on repetitive workloads with strong pattern recurrence
Compatible models	Gemma-4-31b-it, Kimi-K2.6, Qwen3.6-27B, GLM-5, GLM-5.1, DeepSeek-V3.2, MiniMax-M2.5	550k+ models available on Dedicated Endpoints
Utilizes draft model	Yes — provided and managed by FriendliAI	No

To learn more about N-gram speculative decoding, review our blog.

Accelerate inference for your deployment with FriendliAI

Autoregressive decoding is the foundation of LLM generation, but its one-token-per-pass execution model creates a latency floor that grows with output length. Speculative decoding lifts that constraint by amortizing each target-model forward pass across multiple accepted tokens, enabling faster token generation while preserving output quality. On Friendli Dedicated Endpoints, you can choose the speculative decoding strategy that best fits your workload: a draft model approach for diverse, open-ended generation, or N-gram speculative decoding for structured and repetitive outputs. Enable either approach with a simple toggle during endpoint creation and start serving faster today.

Ready to try it? Spin up a Dedicated Endpoint with speculative decoding enabled, or read the documentation for configuration details.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is the Frontier Inference Cloud for Agents, delivering high throughput, low latency, and reliability at scale for agentic workloads. Through vertically optimized inference infrastructure, it delivers 2–5× faster output token speed and a 99.99% uptime SLA for high-volume production traffic.

How does FriendliAI reduce inference costs?

FriendliAI reduces inference costs through higher GPU utilization and optimized inference performance. FriendliAI's patented continuous batching technique, along with quantization, speculative decoding, KV cache offloading, multi-LoRA serving, and autoscaling, helps you serve more tokens with fewer GPUs, lowering your infrastructure costs without sacrificing performance.

Why should I choose FriendliAI over other inference providers?

FriendliAI is built for production AI agents, combining speed, reliability, and efficiency at scale. It delivers low-latency streaming, reliable long-context inference, and robust tool calling without compromising stability. According to independent OpenRouter benchmarks, FriendliAI consistently ranks among the top providers for throughput, latency, and reliability across leading open-weight models. See why customers choose FriendliAI

Which open-weight models does FriendliAI support?

Run today’s frontier open-weight models—including GLM, MiniMax, Kimi, DeepSeek, Qwen, Gemma, and more—with a simple API call. FriendliAI Model API gives you instant access to the latest models with optimized inference performance for production workloads. Explore models and pricing

How do I get started?

Getting started takes just a few minutes. [1] Sign up for FriendliAI, [2] Generate your API key, and [3] Make your first inference request with frontier open-weight models.

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.