Google’s Gemma 4 Gets a Cheat Code: 3x Faster by Skipping Tokens

Last updated: May 23, 2026 12:05 am

AIWadmin

ByAIWadmin

Global AI news & information.

Follow:

The Slow Grind of Local AI Gets a Turbo Button

Let’s be honest: running a large language model on your own hardware has always felt like a compromise. You trade privacy for pokey performance, watching tokens dribble out one agonizing byte at a time. Google just threw a wrench in that tradeoff with its new Multi-Token Prediction (MTP) drafters for Gemma 4. This isn’t some hypothetical research paper. It’s a live, downloadable patch that promises to triple your inference speed with zero quality loss. If that holds up, it’s the biggest practical leap for edge AI since quantization.

Contents

The Slow Grind of Local AI Gets a Turbo Button A Generous Pivot or a Clever Trap?

The dirty secret of local inference is that your gaming GPU is mostly bored. It spends its cycles caching weights and waiting for slow VRAM to feed the beast. Google exploits that dead air with a tiny 74-million-parameter drafter model that guesses multiple future tokens in a single pass. The big model then verifies the guess batch in parallel. It is speculative execution for AI, and it brilliantly turns hardware latency into a throughput opportunity.

A Generous Pivot or a Clever Trap?

Google isn’t just open-sourcing the tech. They shifted the entire Gemma 4 license to Apache 2.0, a massive departure from their previous restrictive Gemma license. This move looks good, but cynical observers note it comes right as regulators are circling Big Tech’s walled gardens. By giving away the razor blades (the models), Google ensures developers stay hooked on the ecosystem’s handle (their frameworks and hardware support). Still, for the hobbyist or privacy-conscious developer, this is a win. You can grab the MTP-enabled models and run them via MLX, Ollama, or vLLM right now.

The real-world benchmarks are impressive but caveated. Pixel phones see the promised 3x boost on the small E4B model. Apple’s M4 Mac gets a 2.5x uplift on the massive 31B dense model. The company claims ‘zero quality degradation’ because the main model still validates the drafter’s guesses. That is mathematically sound, but it ignores the fact that a 3x faster bad answer is still a bad answer. The speed gain is real, but the underlying model’s alignment and factual accuracy remain their own problems.

Source: Arstechnica

Apple CEO Warns of Price Hikes as AI Demand Strains Memory Chip Supply

Researchers Expose How ChatGPT Can Generate Violent and Sexual Images

Taiwanese AI Startups Showcase Innovations at Paris Tech Fair

Microsoft Expands China AI Footprint Through OpenAI Models

Bezos Predicts AI Will Create Labor Shortage, Not Job Losses

Anthropic plants flag in Seoul with new office and government pact on AI safety

AI Pioneer LeCun Warns of Industry Bubble, Calls Musk’s xAI a Misstep

xAI Launches Grok Imagine Video 1.5 with Faster Rendering and Audio

SpaceX Acquires AI Coding Startup Cursor in $60 Billion Stock Deal

AI Assistant Market Shifts as ChatGPT Drops Below 50% Share for First Time

Meta Loses Senior AI Product Leader Amid Enterprise Transformation Push

OpenAI Files for IPO, Set to Join Anthropic and SpaceX in Public Market Surge

New framework lets AI agents share silent thoughts for faster, cheaper reasoning

NVIDIA Jetson Gains Agentic AI with JetPack 7.2 and NemoClaw Framework

How OpenAI’s Algebraic Gambit Toppled a 50-Year-Old Number Theory Giant

Apple’s iOS 27 Siri Overhaul: A Strategic Pivot to AI Brokerage, Not Innovation

OpenAI Publishes Governance Framework as California and EU AI Laws Take Shape

Anthropic Unveils Dynamic Workflows for Claude Code: Parallel AI Agents at Scale

Google’s Gemma 4 Gets a Cheat Code: 3x Faster by Skipping Tokens

The Slow Grind of Local AI Gets a Turbo Button

A Generous Pivot or a Clever Trap?

Quick Links

About Us

The Slow Grind of Local AI Gets a Turbo Button

A Generous Pivot or a Clever Trap?

You Might Also Like

Quick Links

About Us