Did Google’s TurboQuant Actually Solve AI Memory Crunch?

On March 25, 2026, Google Research published a blog post about a compression algorithm called TurboQuant.Within 48 hours, SK Hynix had lost 7.3% of its market value. Micron dropped 3%. Western Digital fell 4.7%. SanDisk gave up 5.7%. Kioxia, the Japanese flash memory company, dropped nearly 6%. The selloff spread across two continents, wiping out tens of billions in market cap.Cloudflare’s CEO Matthew Prince called it “Google’s DeepSeek moment.” Half the internet compared it to Pied Piper, the fictional startup from HBO’s Silicon Valley. The memes moved faster than the actual research.So what actually happened? And does this algorithm change anything about the memory situation the AI industry has been panicking about for the past 18 months?Let’s decode.Why Modern AI Is So Hungry for MemoryWhen an LLM generates text, it doesn’t recompute everything from the beginning with every new word. Instead, it stores all its prior calculations in a fast-access buffer called the key-value cache, or KV cache. Every token the model has seen in a conversation gets stored there, so when the model processes the next token, it can look back at what came before without redoing all the math.The problem is the cache grows continuously. A model working through a 100,000-token document is holding a massive amount of active data in GPU memory just to maintain context. And this got significantly worse when reasoning models became mainstream. Reasoning means long context, long context means a large KV cache, large KV cache means you need a lot of memory. By 2024, anyone paying attention to the trajectory of AI models could see where this was heading and the market mostly didn’t catch up until prices started reflecting it.How the KV Cache Fills a GPU: Short Conversation vs 100,000 Token DocumentAnd the industry has been fighting this problem for years, with genuine ingenuity, and TurboQuant is the latest step in that arc.What TurboQuant Is and How It WorksTurboQuant compresses that KV cache down to 3 bits per value, from the standard 16. The claimed reduction is 6x in memory footprint, with an 8x speedup in attention computation on Nvidia H100 GPUs, and no measurable accuracy loss in benchmarks.The math works in two stages. The first stage, PolarQuant, converts data vectors from Cartesian coordinates into polar coordinates. In Cartesian form, a point is described by how far it sits along the X axis and Y axis: a grid of (x, y). In polar form, the same point is described by its distance from the origin (r) and the angle it makes from a reference direction (θ). The conversion is: r = √(x² + y²) and θ = arctan(y/x). Going back: x = r·cos(θ) and y = r·sin(θ). In higher dimensions, the same principle extends.Why this matters for compression is because in polar space, the angular distribution of AI attention data clusters in predictable, concentrated patterns. Traditional quantization methods have to store extra normalization constants alongside compressed data so the system can decompress accurately later. Those constants add one or two bits per value right back in, partially undoing the savings. PolarQuant eliminates that overhead because the structure of the data in polar space makes those constants unnecessary.How Cartesian Data Clusters in Polar Space to Enable KV Cache CompressionThe second stage handles the residual error left over from stage one. Each leftover error number gets reduced to a single sign bit, positive or negative. That sign bit acts as a statistical zero-bias corrector, meaning the compressed cache remains equivalent to the full-precision original when the model computes attention scores. The model doesn’t notice the difference.Google tested TurboQuant on five standard benchmarks for long-context models, including LongBench and Needle in a Haystack, using Gemma, Mistral, and Llama. At 3 bits, it matched or beat KIVI, the standard baseline for KV cache quantization. On needle-in-a-haystack tasks where the model has to locate a specific fact buried in a long document, it hit perfect scores at 6x compression.

Curious to learn more?

See how our agents can automate document workflows at scale.

Book a demo

The Crunch That Was Years in the MakingThe reason a compression paper could move the memory chip market by 6% in two days is that the memory situation going into 2026 was already extreme. To understand it, you need to go back to 2023.In 2023, memory manufacturers were losing money. DRAM prices had collapsed after the pandemic oversupply, and Samsung, SK Hynix, and Micron all pulled back on capital expenditure. They weren’t building new fabs because there was no margin to justify it. But it coincided precisely with the beginning of the reasoning model era, which was about to create a demand curve no one had seen before in this industry.Let’s understand why AI is so hard on memory. A GPU needs data to move at extreme speeds to keep its processors fed. An HBM4 stack, the type of memory used in Nvidia’s latest chips, transfers memory at roughly 2.5 terabytes per second. A comparable area of standard DDR5, the memory in your laptop, does somewhere around 64 to 128 gigabytes per second. Consumer memory is built for a completely different job. HBM4 vs DDR5 Memory Bandwidth: Why AI GPUs Need 2.5 TB/s and Laptops Get 128 GB/sHBM is built differently, stacked in multiple layers, connected with thousands of micro-connections called through-silicon vias, and it’s extraordinarily expensive to produce. Producing one gigabyte of HBM consumes four times the wafer capacity of standard DRAM. To put that in GPU terms: a single Nvidia H100 currently costs between $25,000 and $30,000 per chip, and memory accounts for roughly 30% of the cost of deploying AI at scale. When Meta built its initial H100 training cluster with 24,000 of those chips, the GPU hardware bill alone crossed $800 million, before a single power cable was run or a server rack assembled. That’s one cluster, hyperscalers are building dozens. Of the $600 billion in combined Big Tech capital spending this year, roughly $180 billion is going to memory alone.People usually make the “just make more memory” argument. Global silicon wafer production capacity is growing, but only at around 6 to 7% per year. AI infrastructure spending is growing at rates many times that. The fabs that will eventually close the gap started construction after the demand signal hit, which means the meaningful new capacities don’t come online until 2027-2028 and the crunch can potentially last until 2030.The Compression Arms Race That Was Already HappeningThe industry has been chipping away at the KV cache memory problem for years.GPT-2 XL, the largest 2019 variant, used the simplest possible design: every attention head kept its own independent set of keys and values. Cost: around 300 kilobytes per token. By 2024, Llama 3 8B introduced grouped-query attention, where multiple heads share the same stored representations instead of maintaining separate copies. Cost dropped to 128 kilobytes per token, less than half, with almost no quality loss on benchmarks. Then DeepSeek V3 went further with multi-head latent attention, compressing the key-value pairs into a lower-dimensional form before storing them and decompressing at inference time. Cost: 68.6 kilobytes per token, on a model with 671 billion total parameters, though only 37 billion are active at any moment.KV Cache Per Token: GPT-2 XL to Llama 3 to DeepSeek V3 and the Shannon Limit TurboQuant Is ApproachingThat progression, 300 to 128 to 68 kilobytes per token, is the compression arc that existed before TurboQuant showed up. Each step traded something, usually some architectural complexity or slight recall degradation, for meaningful memory savings. Each step also captured the easier gains first. What remained got harder.So by the time TurboQuant arrived, the low-hanging fruit was gone. TurboQuant matters less because it saves additional memory and more because it marks where KV cache compression is approaching the information-theoretic limit. You’re close to the Shannon ceiling. Every additional bit squeezed out from here costs more engineering effort and risks more quality degradation than the last.There’s also a problem no compression algorithm touches. When the KV cache grows too large for available GPU memory, models often summarize their own context into a shorter form and continue from the summary. The compression is lossy in ways the model can’t detect. A specific budget figure becomes “approximately that amount.” A nuanced instruction becomes “something about guidelines.” The model keeps going, confident in information that no longer fully exists. Compression makes the cache smaller. It doesn’t solve the problem of deciding what’s actually worth keeping.So Why the Market Reaction Was WrongThe stocks fell for the same reason markets often overreact to technical announcements: most investors read the headline, not the paper.TurboQuant only addresses inference memory, specifically the KV cache during inference. Training a model, the months-long, multi-billion-dollar process of teaching the model in the first place, requires fundamentally different memory, driven by activations, gradients, and optimizer states. TurboQuant has zero effect on any of that. The massive HBM buildout that hyperscalers are funding exists primarily to train and retrain ever-larger models. That demand curve is untouched by a KV cache compression algorithm.Beyond training, TurboQuant is a research result with no production deployment. The paper was originally published in 2025 and got re-featured on the blog ahead of ICLR. Google itself hasn’t deployed it widely in the year since the math was first documented.The 6x headline also deserves scrutiny. It’s benchmarked against 16-bit full-precision. Commercial inference already runs at 4 or 8 bits as standard practice. So the real marginal gain over deployed systems is smaller than the number suggests.Jevons Paradox is another thing to talk about. When DeepSeek launched dramatically more efficient inference in early 2025, the same fear spread: HBM demand would drastically fall but it didn’t. Because cheaper inference expanded the set of organizations that could economically deploy AI, which drove more total demand for infrastructure. When inference costs fall, more applications become viable, more models stay active, and memory companies end up as the long-run beneficiary.Jevons Paradox in AI Memory: How DeepSeek and TurboQuant Both Drove Higher HBM Demand Despite Efficiency GainsThe market has now seen this exact movie twice, but panicked both times. Weird right?

Curious to learn more?

See how our agents can automate document workflows at scale.

Book a demo

So What TurboQuant Actually ChangesThe algorithm does have real implications. They’re just different from what the market priced in.The most immediate is inference economics. TurboQuant compresses the KV cache, which determines how many concurrent users a single GPU can serve and how long a context window is practical at scale. If it gets deployed across production inference stacks, the throughput per GPU increases. That matters for AI products running millions of queries per day, where inference cost is the recurring expense that determines profitability. Anything that changes the memory-to-compute ratio per query shifts the cost structure of running AI products.The longer-term implication is on-device AI. Right now, running a capable language model locally on a phone or laptop requires either compromising on quality or buying expensive hardware. If TurboQuant’s approach gets implemented in local inference runtimes at scale, the hardware floor for running a meaningful AI model drops. Models that currently require cloud infrastructure could run locally. But it plays out over years, not quarters, and it has more to do with software ecosystem adoption than with whether memory chip stocks are correctly priced today.It’s definitely real math that compresses one specific type of memory usage during one phase of AI operation. But it doesn’t build fabs and it doesn’t change training economics. Memory gets built in clean rooms in South Korea and Idaho, by people operating tools that cost hundreds of millions of dollars each. That part of the supply chain moves on a completely different clock than an algorithm (or just a research paper.)So the crunch only ends when the fabs are done.