• May 19, 2026
  • 5 min read

Accelerating Inference on Friendli Dedicated Endpoints with Draft-Model Speculative Decoding

TL;DR
  • Draft-model speculative decoding is now available on Friendli Dedicated Endpoints with a single toggle at endpoint creation and no code changes.
  • A smaller draft model proposes candidate tokens that the target model verifies in a single parallel forward pass, preserving output quality.
  • FriendliAI trains and automatically pairs draft models with supported targets including Gemma-4-31b-it, Kimi-K2.6, Qwen3.6-27B, DeepSeek-V3.2, MiniMax-M2.5, GLM-5, and GLM-5.1.
  • Unlike N-gram speculative decoding, the draft model generalizes beyond literal repetition, delivering accurate candidates on diverse, open-ended workloads.
  • Best suited for agentic pipelines, long-form reasoning and structured outputs, and code completion where sequential decoding dominates latency.
Accelerating Inference on Friendli Dedicated Endpoints with Draft-Model Speculative Decoding thumbnail

Large language models typically generate text autoregressively, producing one token at a time through repeated decoding steps. As generations become longer, these sequential decoding iterations increasingly dominate inference latency. Speculative decoding mitigates this bottleneck by proposing multiple candidate tokens ahead of time and verifying them in parallel, reducing the number of expensive target-model decoding passes during generation.

Draft-model speculative decoding uses a smaller, faster draft model to propose candidate tokens that the target model verifies in parallel. This capability is now available on Friendli Dedicated Endpoints alongside training-free approach such as N-gram speculative decoding and can be enabled with a simple toggle in the user interface. For supported models, it can increase output speed (i.e., reduce time-per-output-token) through the use of pre-trained draft models without requiring application code modifications.

Enable speculative decoding
Enable speculative decoding through a toggle in the user interface

How draft-model speculative decoding works

Standard autoregressive decoding runs one target-model forward pass per generated token, making generation fundamentally sequential. Draft-model speculative decoding breaks this loop: a smaller, faster draft model proposes multiple subsequent tokens, and the target model verifies them in a single parallel forward pass over the candidate sequence.
Each proposed token is accepted or rejected according to the target's own next-token distribution, using a verification rule designed to preserve the target model’s original output distribution. When draft tokens are accepted, the system emits multiple tokens from a single target-model forward pass. When rejected, the remaining candidates are discarded, and decoding resumes from a token sampled from the adjusted residual distribution.

Because the draft model is trained on the target model’s output distribution, it proposes candidates the target is likely to accept, maintaining useful acceptance rates across diverse workloads. FriendliAI trains and automatically pairs draft models with supported target models, eliminating the need for additional training and extra models to manage.

Draft-model speculative decoding
Draft model proposes candidate tokens that the target model verifies in a single parallel forward pass for faster output with similar quality.

This approach works because LLM inference at low batch sizes is typically memory-bandwidth-bound: each target-model forward pass spends much of its time loading weights from GPU memory, with compute units underutilized. Because the draft model is much smaller, generating candidate tokens requires far less memory bandwidth than a target-model pass. The target model then verifies those candidates in a single parallel forward pass at roughly the cost of a standard single-token decoding step, allowing the same memory traffic to produce multiple accepted tokens. This is where the increase in output speed comes from for latency-bound workloads. The size of the gain depends on how predictable the target model’s outputs are: workloads with repetitive or structured token patterns yield high acceptance rates and large speedups, while highly variable, open-ended outputs accept fewer draft tokens and see smaller gains. At higher batch sizes, when the target model becomes compute-saturated, the performance gains naturally narrow.

Draft-model speculative decoding is particularly well suited for:

  • Agentic pipelines where LLM calls chain together and latency compounds across steps
  • Long-form generation — including reasoning chains, structured outputs (JSON, code, markdown), and summarization — where repetitive token patterns increase acceptance rates and per-token savings accumulate over long outputs
  • Code completion and IDE assistants, where single-user sessions, tight latency budgets, and highly repetitive token patterns combine to deliver some of the strongest gains

Currently, the following models support draft-model speculative decoding on Friendli Dedicated Endpoints:

Draft-model vs. N-gram speculative decoding: choosing the right method

While draft-model speculative decoding uses a trained network to predict candidate tokens, N-gram speculative decoding leverages recurring token patterns in the target model’s own output. The system scans the prompt and generated output so far for repeated N-token sequences (where N denotes the prefix match length). When a match is found, the tokens that appeared after that sequence are used as speculative candidates without requiring an additional model. The target model verifies those candidates in a single parallel forward pass, accepting the matching prefix and discarding the remaining tokens. On Friendli Dedicated Endpoints, you select one speculative decoding strategy based on the characteristics of your workload.

N-gram speculative decoding performs best on highly structured and repetitive outputs where repeated token sequences are common and pattern matching introduces negligible overhead. As output diversity increases, N-gram match rates decline and the performance gains gradually diminish. On poorly matched workloads, occasional speculation rejections can even introduce slight latency compared to standard autoregressive decoding.

Draft-model speculative decoding generalizes beyond literal repetition because the predictor is learned from the target model’s output distribution, so it can propose accurate candidates even when the exact token sequence has not appeared before. N-gram speculative decoding, however, is compatible with any model on Dedicated Endpoints and requires no additional model. You can configure a maximum N-gram size between 1 and 10 during endpoint creation, with 3 recommended as a practical starting point.

Draft-model speculative decodingN-gram speculative decoding
MechanismA draft model proposes candidate tokens, which the target model verifies in parallelPreviously observed repeating token patterns are reused to predict continuations
Optimal use casesAgentic pipelines, code completion and IDE assistants, long-form generation with reasoning chains or structured outputs and summarizationStructured and repetitive outputs: code generation, and templated reports
Impact on inference performanceIncreases output speed (i.e., reduces TPOT) in memory-bandwidth-bound regimes (low-batch serving)Delivers output speedups on repetitive workloads with strong pattern recurrence
Compatible modelsGemma-4-31b-it, Kimi-K2.6, Qwen3.6-27B, GLM-5, GLM-5.1, DeepSeek-V3.2, MiniMax-M2.5550k+ models available on Dedicated Endpoints
Utilizes draft modelYes — provided and managed by FriendliAINo

To learn more about N-gram speculative decoding, review our blog.

Accelerate inference for your deployment with FriendliAI

Autoregressive decoding is the foundation of LLM generation, but its one-token-per-pass execution model creates a latency floor that grows with output length. Speculative decoding lifts that constraint by amortizing each target-model forward pass across multiple accepted tokens, enabling faster token generation while preserving output quality. On Friendli Dedicated Endpoints, you can choose the speculative decoding strategy that best fits your workload: a draft model approach for diverse, open-ended generation, or N-gram speculative decoding for structured and repetitive outputs. Enable either approach with a simple toggle during endpoint creation and start serving faster today.

Ready to try it? Spin up a Dedicated Endpoint with speculative decoding enabled, or read the documentation for configuration details.


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 570,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, support@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.


Explore FriendliAI today