Setting the Scene
On Wednesday, something curious happened on Wall Street. The Nasdaq 100 was climbing — but memory semiconductor stocks were swimming against the tide. SanDisk dropped 5.7%, Western Digital fell 4.7%, Seagate slid 4%, and Micron lost 3%. The trigger? A compression algorithm called TurboQuant, published by Google Research. The selloff quickly rippled across the Pacific, dragging down shares of Samsung Electronics and SK hynix — the world's two largest memory chipmakers — along with the rest of Korea's semiconductor sector.
"AI tech that cuts memory usage" — at first glance, that sounds like terrible news for memory companies. But here's the thing: what this technology actually reduces is temporary working memory on the GPU (KV cache1), not the HBM2 modules or DRAM sticks plugged into servers. There's a gap between what the market read and what the technology actually says — and beyond that gap lies a much bigger question spanning the entire AI hardware stack. Today, let's unpack both the gap and the question.
(Honestly, if you don't care about how the algorithm works, feel free to skip straight to "Why Did the Market React?")
What TurboQuant Actually Does
Let's start with the tech. When an AI carries on a conversation, it needs to remember what was said earlier. That memory gets stored in something called KV cache — a temporary buffer in GPU memory. The longer the conversation, the more this buffer balloons, and it's one of the biggest drivers of AI inference costs. TurboQuant is an algorithm that compresses this memory as tightly as possible while keeping the content nearly intact. No retraining or fine-tuning required.
Here's the core idea, explained by analogy. The data AI models store has wildly uneven value distributions — some numbers are big, some are small, and the spread is all over the place. That kind of data is hard to compress efficiently. TurboQuant's first stage, called PolarQuant, applies a random rotation to the data, smoothing out the distribution so all values become roughly uniform. Think of it like taking a pile of oddly shaped luggage, shuffling everything around, and ending up with uniform boxes that pack neatly into a container. In the paper's mathematical terms, after rotation each coordinate follows a beta distribution and becomes nearly independent, enabling optimal scalar quantization3 on each coordinate individually.

But there's another wrinkle. AI doesn't just store information — it constantly compares stored values against each other through dot-product operations. As the paper proves, quantizers optimized for MSE (mean squared error) introduce a systematic bias into these comparisons. That's where the second stage comes in: QJL (Quantized Johnson-Lindenstrauss). After the first round of compression shrinks the data, QJL applies a 1-bit correction to the residual error, completely eliminating that dot-product bias.
The experimental results are striking. At 3.5 bits, the model maintains virtually identical quality to the uncompressed original. Even at an aggressive 2.5 bits, quality degradation was minimal. On the Llama-3.1-8B-Instruct model, the LongBench benchmark average was essentially unchanged (50.06 vs. 50.06), and the Needle in a Haystack test maintained 100% accuracy at 104,000 tokens. Overall compression: 4.5x or better.
The speed numbers are especially impressive. Traditional product quantization (PQ) takes about 240 seconds to index 1,536-dimensional data. TurboQuant does it in 0.0013 seconds. That's roughly a 180,000x difference — possible because TurboQuant is a data-oblivious method that doesn't need to learn a codebook from the data.
One important caveat, though. The "up to 8x speed improvement" that Google's blog post highlights was measured on a specific stage — the attention logit computation — and benchmarked against a JAX baseline. It's not an 8x improvement in end-to-end inference throughput. And the "6x memory reduction" claim also has a subtle discrepancy between blog and paper — the paper is more conservative, stating "4.5x or more." Numbers getting packaged differently depending on the publication channel is something you should always watch for when reading tech news.
Why Did the Market React?
The market's logic was simple: "If AI only needs one-sixth the memory, doesn't that mean less demand for memory?" And because memory stocks had already run up massively this year — SanDisk was up over 200% since January — traders were looking for an excuse to take profits. But step back for a moment and you'll notice that KV cache and HBM both use the word "memory," yet they operate on completely different layers.
KV cache is temporary data stored in GPU memory so that an LLM doesn't have to recompute attention from scratch every time it generates a token. When context windows stretch to 1 million tokens, KV cache alone eats up roughly 512 GB for a Llama 3 70B model — four times larger than the model weights themselves. That's why KV cache compression is one of the hottest research topics in AI infrastructure right now.
HBM demand, on the other hand, is driven by bandwidth bottlenecks across the entire training and inference pipeline. TrendForce estimates 2026 HBM demand will grow 70% year-over-year, and Bank of America projects the 2026 HBM market at roughly $54.6 billion (up 58% YoY). SK hynix, Samsung, and Micron have all said their 2026 HBM production is essentially sold out.
Here's a simple analogy. TurboQuant improves how you organize the sticky notes on your desk. HBM demand is about needing more offices in the building. Better note-taking doesn't reduce the need for office space. If anything, more efficient note-taking means each office gets more work done — which might make you want to build even more offices.
This Isn't Just Google — Meet Nvidia's KVTC
Here's a piece of context that's easy to miss. Google isn't the only one working on KV cache compression. At the same ICLR 2026 conference, Nvidia is presenting KVTC (KV Cache Transform Coding) — a technique that applies the transform coding principles from JPEG image compression to KV cache. They claim up to 20x compression, with certain cases hitting 40x. (Of course it's Nvidia. Of course.)
Google's TurboQuant (6x compression, no training needed) and Nvidia's KVTC (20x compression, requires pre-calibration) take different approaches, but they're solving the same problem. And Nvidia plans to integrate KVTC into Dynamo, its inference framework.
Why does this matter? Once KV cache compression ships in production, the same GPU can handle longer contexts and more concurrent requests. For AI operators, that means lower inference costs. But does efficiency kill demand? Not necessarily — think back to the Jevons Paradox4 I covered in a previous newsletter. The market, however, is reading this technology from a different angle.
Oswarld's View
Honestly, I find the market's reaction more interesting than TurboQuant itself. Because this isn't just a memory stock story.
Zoom out a bit, and the entire AI hardware stack is facing the same question. Nvidia posted $215.9 billion in revenue for fiscal year 2026 with net margins above 55% — unprecedented numbers — yet the stock is trading about 11% below its October high. Same story with Micron: they reported a record quarter just two days ago ($23.86 billion in revenue, 74.9% gross margin), but the market fixated on whether they can sustain $25+ billion in capital expenditures. GPUs are down. DRAM is down. NAND storage is down.
The real question the market is asking is: "Can this pace of infrastructure investment be sustained?" The combined 2026 capex guidance from the Big Four — Microsoft, Meta, Alphabet, and Amazon — comes to $650–700 billion. That's among the largest concentrations of private capital ever directed at a single purpose in human history. The debate among investors over whether the returns can justify that spend is growing louder. This is the same backdrop that fueled all the "AI bubble" talk.
From a GTM strategy perspective, here's how I frame it. Every technology infrastructure cycle has a "build phase" and an "optimize phase." In the build phase, the strategy is "just deploy it." In the optimize phase, the strategy shifts to "how do we maximize what we've already deployed?" TurboQuant, Nvidia's KVTC, and hyperscalers developing their own chips (Google's TPU, Amazon's Trainium) — all of these are signals of the optimize phase.
Does that make this a bearish signal? I don't think so. The optimize phase isn't the end of growth — it's growth maturing. What changes is how the market prices it. In the build phase, the trade was "buy everything." In the optimize phase, you need to sort out who benefits from the efficiency gains and who bears the costs.
Memory fundamentals are still strong. HBM is sold out through all of 2026, and BofA's market estimate is $54.6 billion (58% YoY growth). But once the narrative that "algorithms are replacing hardware" lodges in investors' minds, the higher the valuations, the more twitchy the reaction. If SanDisk is up 200%+ since January, a single research paper can trigger a 5.7% drop.
The key is distinguishing between time horizons. Software optimizations like TurboQuant could start affecting the rate of growth in hardware demand from 2027 onward. But 2026's memory supply shortage is a physical problem — fab construction and yields — not something an algorithm can fix. When the market confuses these two timelines, that's where both the opportunity and the risk live.
Wrapping Up
One more thing before I close. The original TurboQuant paper (arXiv:2504.19874) was published on April 28, 2025. That's almost a year ago. Someone spent a year refining this work, got it posted on the Google Research blog, prepared it for ICLR 2026, and finally moved the market. It's fascinating to think about what accumulated in that gap between a research lab idea and a Wall Street stock move.
Here's the bottom line. TurboQuant is a meaningful advance in AI inference efficiency. But the reason memory stocks dropped today isn't about this one algorithm — it's because the market is starting to read the "build to optimize" phase transition across the entire AI hardware stack. If you can understand the layers of technology and separate the timelines, you can make better calls in this kind of volatility.
If a year-old paper shook today's market, what are the papers coming out of labs right now going to shake a year from now?
⚠️ This newsletter is not a recommendation to buy or sell any specific stock. Always do your own research and consult a professional before making investment decisions.
References & Further Reading
Key Sources
Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V., "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate", arXiv:2504.19874, 2025. To be presented at ICLR 2026. — The original TurboQuant paper at the center of today's newsletter. You can find the mathematical proofs and experimental results for the two-stage compression architecture here. Pay special attention to Table 1 (LongBench results) and Figure 4 (Needle in a Haystack results).
Google Research, "TurboQuant: Redefining AI efficiency with extreme compression", 2026. — Google's blog post summarizing the paper for a general audience. Note the slight discrepancies between blog and paper numbers — check the paper for exact figures.
Łańcucki et al., "KV Cache Transform Coding for Compact Storage in LLM Inference", ICLR 2026. — Nvidia's KVTC paper. Comparing their approach with TurboQuant gives you a view of the full KV cache compression landscape.
Investing.com, "MU, WDC, SNDK fall: Why Google's TurboQuant is rattling memory stocks", March 25, 2026. — Coverage of today's memory stock selloff.
Background Reading
SK hynix, "2026 Market Outlook: SK hynix's HBM to Fuel AI Memory Boom", January 2026. — Useful for understanding the structural demand picture in the 2026 memory market, including HBM3E and HBM4 demand forecasts.
Fortune, "Rampant AI demand for memory is fueling a growing chip crisis", February 2026. — A deep dive into how AI is reshaping the entire memory market.
CNBC, "Even a $1 trillion forecast can't break Nvidia out of a 2026 funk", March 2026. — Analysis of why Nvidia's stock is treading water despite record results. Good for understanding the market psychology behind the build-to-optimize phase shift.
VentureBeat, "Nvidia says it can shrink LLM memory 20x without changing model weights", March 2026. — Covers Nvidia's KVTC from a production deployment angle. A useful companion piece for understanding how it differs from Google's TurboQuant.
1 KV Cache (Key-Value Cache): Temporary memory where an LLM stores previously computed "keys" and "values" during a conversation. Without it, the model would have to recompute everything from scratch for each new token — so the longer the conversation, the more GPU memory this cache consumes.
2 HBM (High Bandwidth Memory): Ultra-fast memory stacked vertically right next to the AI chip (GPU). It transfers data several times faster than standard DRAM, making it a critical component for AI training and inference.
3 Quantization: A technique that reduces high-precision numbers (e.g., 32-bit floating point) to fewer bits (e.g., 4-bit or 3-bit). It shrinks memory footprint, but pushing too aggressively can degrade information quality.
4 Jevons Paradox: The phenomenon where improving the efficiency of resource use actually increases total consumption. Named after 19th-century British economist William Stanley Jevons, who observed that improvements in coal efficiency didn't lead to less coal being burned — quite the opposite.

