Weizman Institute

Intel and Weizmann Institute Remove the “Speculative Decoding” Bottleneck

Posted on 20 July 202520 July 2025 by Techtime

Top image: Nadav Timor (right) and Prof. David Harel. Photo: Weizmann Institute of Science

A joint team from Intel Labs and the Weizmann Institute of Science has presented a groundbreaking method for significantly accelerating AI processing based on large language models (LLMs). The research was showcased this week at ICML 2025 in Vancouver, Canada—one of the world’s top AI and machine learning conferences. The paper was selected for oral presentation, a rare honor granted to only 1% of the approximately 15,000 submissions.

The work was led by Prof. David Harel and PhD student Nadav Timor from the Weizmann Institute, in collaboration with Moshe Wasserblat, Oren Pereg, Daniel Korat, and Moshe Bartchansky from Intel, along with Gaurav Jain from d-Matrix. While LLMs like ChatGPT and Gemini are powerful, they are also slow and resource-hungry. As early as 2022, the industry began exploring ways to speed up inference by splitting tasks between different algorithms. This led to the emergence of Speculative Decoding, in which a smaller, faster model “guesses” the next output tokens, and the larger model only verifies the guess instead of computing it from scratch.

A Fast, Lightweight Helper Model

How does it work? In the standard process, LLMs must compute huge operations for every word they generate. For example, to complete the sentence “The capital of France is…”, a model might generate “Paris”, then read “The capital of France is Paris” and compute again to generate “a”, then once more to generate “city”. In total, it performs three heavy compute steps for three words.

With speculative decoding, a fast auxiliary model first drafts the entire phrase—“Paris”, “a”, “city”. Then the larger model checks the full draft in a single validation step. If the guess is correct, all three words are accepted, drastically reducing processing time.

From right to left: Moshe Wasserblat, Oren Pereg, Daniel Korat, and Moshe Bartchansky. Photo: Intel

The Bottleneck That Held the Industry Back

Although speculative decoding has been known for over three years, real-world adoption has been difficult. That’s because LLMs don’t truly “understand” words—they operate based on statistical relationships between tokens. Each model develops its own internal “digital language” of token IDs. For example, the word “apple” might be token #123 in one model and #987 in another.

Until now, speculative decoding only worked when both models (large and auxiliary) used the exact same tokenizer and architecture—usually only possible if they were built by the same company. Developers couldn’t simply pair any fast model with any LLM; they were locked into specific ecosystems.

This created a major bottleneck. The Israeli team overcame this with a new class of algorithms that decouple helper models from LLM architectures, making them cross-compatible across platforms, vocabularies, and companies.

A Surprising Solution to the Compatibility Problem

To bridge this gap, the researchers developed two key techniques. First, an algorithm that enables an LLM to translate its “thoughts” into a language understood by other models. Second, an algorithm that ensures both the large and small models rely primarily on token cognates—tokens with equivalent meanings across different token vocabularies.

“At first we feared that too much would get ‘lost in translation’ and the models wouldn’t sync,” said Nadav Timor, a PhD student in Prof. Harel’s lab and lead author of the paper. “But our fears proved unfounded.”

According to Timor, the algorithms achieved up to 2.8× speedups in LLM performance—resulting in dramatic compute cost savings. “This makes speculative decoding accessible to any developer,” he said. “Until now, only companies with the resources to train custom small models could benefit from these techniques. For a startup, building such a model would have required deep expertise and significant investment.”

Now Available on Hugging Face

The new algorithms have already been integrated into the open-source platform Hugging Face, making them freely available to developers worldwide.

Read the full research paper:
https://arxiv.org/pdf/2502.05202

Analytics

WP Engine - Hosting Provider

Cloudflare - Cloud based security and web performance processor.

Google Cloud Platform - data centers provider for WP Engine

Sucuri - Website security provider

Mailchimp - Newsletter service provider

Google Analytics, Adwords, Webmasters

Facebook - We use Facebook for advertising and place tracking code on our website for enhancing digital marketing campaigns (i.e - Facebook Pixel).

Planwize Ltd - Digital Marketing Agency.

Intel and Weizmann Institute Remove the “Speculative Decoding” Bottleneck

A Fast, Lightweight Helper Model

The Bottleneck That Held the Industry Back

A Surprising Solution to the Compatibility Problem

Now Available on Hugging Face

Who we are

What personal data we collect and why we collect it

Comments

Media

Analytics

How long we retain your data

Request for Receiving Data Associated with One’s Email Address

Where we send your data

Contact information

How we protect your data

What data breach procedures we have in place

What third parties we receive data from

What automated decision making and/or profiling we do with user data