August 8, 2025
2 min read

Introducing N-gram Speculative Decoding: Faster Inference for Structured Tasks

We’re excited to introduce N-gram Speculative Decoding, a new feature in Dedicated Endpoints that speeds up LLM responses for structured and predictable tasks — such as code generation, legal drafting, or templated writing.

This is the first in a series of speculative decoding techniques coming to Dedicated Endpoints. N-gram speculative decoding accelerates inference by leveraging common N-gram patterns — with no changes required to your model or pipelines. It’s now available as a free-to-try feature for all plans.

What Is N-gram Speculative Decoding?

N-gram speculative decoding uses known token patterns (or N-grams) to look ahead in the generation process and predict likely next tokens. This approach enables the system to generate multiple tokens in parallel — greatly improving latency for deterministic or structured outputs.

Unlike draft-model speculative decoding, which relies on a separate lightweight model to propose future tokens, N-gram speculative decoding leverages the model’s own output patterns, making it faster to initialize and simpler to deploy.

With N-gram speculative decoding, you get:

Faster outputs: Reduces Time-per-Output-Token (TPOT)
No need to train draft models: It works without draft models, simplifying the setup
Seamless integration: Just toggle it on during endpoint creation
Optimized performance: Especially powerful when combined with Friendli Inference

This makes N-gram speculative decoding ideal for applications like:

Code generation
Formatted emails or reports
Legal contracts
Structured JSON generation
Templated summaries

Getting Started

To enable N-gram speculative decoding:

Create a new Dedicated Endpoint
Toggle on “N-gram speculative decoding” under “Endpoint features”
(Optional) Set minimum and maximum N-gram size for customization
Deploy — no model or code changes needed

Figure 1: Create an endpoint with N-gram speculative decoding enabled.

To see whether N-gram speculative decoding is enabled or not for an endpoint, you can check the overview page.

Figure 2: An endpoint overview with N-gram speculative decoding enabled.

When enabled, N-gram speculative decoding automatically detects frequently occurring token sequences and uses those patterns to pre-generate likely continuations. The model then verifies these predictions in parallel, skipping unnecessary steps and speeding up generation.

If the predicted N-grams are correct, they're committed instantly. If not, the model falls back to standard decoding. This yields faster inference with minimal overhead and no accuracy tradeoff.

N-gram speculative decoding further accelerates generation with lookahead techniques, integrating seamlessly with our other advanced technologies powered by Friendli Inference.

To learn more about N-gram speculative decoding, check out our documentation!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.