Speculative Decoding

Speculative Decoding
N-gram speculative decoding
Advanced speculative decoding (Coming soon)

N-gram speculative decoding

You may toggle the switch to activate N-gram speculative decoding. When enabled, past tokens are leveraged to pre-generate future tokens. For predictable tasks, this can deliver substantial performance gains. You can also set the Maximum N-gram Size, which defines how many tokens are predicted in advance. We recommend keeping the default value of 3.

Higher values can further reduce latency when successful. However, predicting too many tokens at once may lower prediction efficiency and, in extreme cases, even increase latency.

Advanced speculative decoding (Coming soon)

We’ll soon be releasing an advanced speculative decoding feature that requires training but delivers better performance in most cases compared to the N-gram method. If you’re interested, please Contact us for more details.

Online Quantization Multi-LoRA Serving

⌘I

Get Started

Capabilities

Friendli Dedicated Endpoints

Friendli Serverless Endpoints

Friendli Container

Speculative Decoding