Large Language Models (LLMs) are at the centre of recent rapid progress in artificial intelligence, yet their size often makes them slow during the inference phase. For user-facing products, this slow response can result in an undesirably sluggish experience.
A primary reason for this is that an LLM generates its output one token at a time, where each token is typically a word or part of a word. This means a model must run its entire set of weights for every single decoding step. For the largest models, this can require reading nearly a terabyte of data for every single word produced, making memory bandwidth the main bottleneck for performance.
To overcome this, researchers at Google published a technique called speculative decoding, which can significantly reduce inference times without compromising quality. The algorithm is based on the observation that some tokens are much easier to generate than others. For example, generating the next token in a common phrase is simple, while computing a specific answer requires more effort. Speculative decoding leverages this by using a fast approximation function, usually a much smaller model, to guess multiple tokens in advance. The large, slow model then verifies these guesses in a single parallel step.
Because modern hardware such as GPUs can perform hundreds of operations for every byte read from memory, there are ample spare computational resources available to run these small models alongside the main task. This technique can result in speed improvements of two to three times for tasks such as translation and summarisation.
If the large model finishing its computation finds that the small model’s guess was correct, the system has successfully increased parallelisation. If the guess was incorrect, the computation is simply discarded, and the system reverts to the standard serial process. Because modern hardware such as GPUs can perform hundreds of operations for every byte read from memory, there are ample spare computational resources available to run these small models alongside the main task. This technique can result in speed improvements of two to three times for tasks such as translation and summarisation.
Speculative decoding has been widely adopted throughout the industry and is now a significant part of optimising large-scale products such as Google Search. Producing results faster with the same hardware also means that fewer machines are required to serve the same amount of traffic, which translates to a direct reduction in energy costs. This paradigm has also proven effective for other optimisation techniques, such as distilling knowledge from target models into draft models. As the usage of LLMs continues to grow, the need for these more efficient inference methods becomes increasingly critical for sustainable deployment.
The team at Academii are always happy to discuss all your training and education needs, help your organisation attract and train new talent, and build a resilient workforce. Please drop us a line here to know more.













































































