- August 8, 2025
- 2 min read
Introducing N-gram Speculative Decoding: Faster Inference for Structured Tasks

We’re excited to introduce N-gram Speculative Decoding, a new feature in Dedicated Endpoints that speeds up LLM responses for structured and predictable tasks — such as code generation, legal drafting, or templated writing.
This is the first in a series of speculative decoding techniques coming to Dedicated Endpoints. N-gram speculative decoding accelerates inference by leveraging common N-gram patterns — with no changes required to your model or pipelines. It’s now available as a free-to-try feature for all plans.
What Is N-gram Speculative Decoding?
N-gram speculative decoding uses known token patterns (or N-grams) to look ahead in the generation process and predict likely next tokens. This approach enables the system to generate multiple tokens in parallel — greatly improving latency for deterministic or structured outputs.
Unlike draft-model speculative decoding, which relies on a separate lightweight model to propose future tokens, N-gram speculative decoding leverages the model’s own output patterns, making it faster to initialize and simpler to deploy.
With N-gram speculative decoding, you get:
- Faster outputs: Reduces Time-per-Output-Token (TPOT)
- No need to train draft models: It works without draft models, simplifying the setup
- Seamless integration: Just toggle it on during endpoint creation
- Optimized performance: Especially powerful when combined with Friendli Inference
This makes N-gram speculative decoding ideal for applications like:
- Code generation
- Formatted emails or reports
- Legal contracts
- Structured JSON generation
- Templated summaries
Getting Started
To enable N-gram speculative decoding:
- Create a new Dedicated Endpoint
- Toggle on “N-gram speculative decoding” under “Endpoint features”
- (Optional) Set minimum and maximum N-gram size for customization
- Deploy — no model or code changes needed
To see whether N-gram speculative decoding is enabled or not for an endpoint, you can check the overview page.
When enabled, N-gram speculative decoding automatically detects frequently occurring token sequences and uses those patterns to pre-generate likely continuations. The model then verifies these predictions in parallel, skipping unnecessary steps and speeding up generation.
If the predicted N-grams are correct, they're committed instantly. If not, the model falls back to standard decoding. This yields faster inference with minimal overhead and no accuracy tradeoff.
N-gram speculative decoding further accelerates generation with lookahead techniques, integrating seamlessly with our other advanced technologies powered by Friendli Inference.
To learn more about N-gram speculative decoding, check out our documentation!
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.