Von Neumann Refuses to Die

Why the architecturally correct answer lost the market, and why the window is reopening.

Current-gen Nvidia GPUs on common LLM inference workloads sit at 20–30% utilization. The rest of the time, they’re waiting.

The bottleneck is HBM. LLM workloads split into two microarchitecturally opposite regimes: prefill (processing the input prompt) is compute-bound and GPUs handle it well. Autoregressive decode, generating tokens one by one, is a different beast entirely. The GPU L2 cache hits on fewer than 2% of accesses. Every weight load misses to HBM, paying a ~10× bandwidth penalty. The chip sits idle 70–80% of the time.

Metric	Prefill / Training	Autoregressive Decode
Arithmetic intensity	hundreds of FLOP/byte	~1 FLOP/byte
vs H100 ridge (~295 FLOP/byte)	above, compute-bound	200–300× below
L1 cache hit rate	high (GEMM tile reuse)	1.5–16%
L2 cache hit rate	68–84%	0.06–1.6%
Cycles stalled on memory	~20–30%	70–80%
MFU	38–55%	<20%, often <10%

This is not a tuning problem. One token generated = entire weight set streamed from HBM with zero temporal reuse. Batching helps for weight matmuls, but each request owns its own KV cache, so attention stays at ~1 FLOP/byte regardless of batch size. Large-batch decode doesn’t become compute-bound, it hits HBM bandwidth saturation instead. Von Neumann has no answer for this. The architecture is structurally wrong for the workload.

Everyone could see the problem. And wafer scale looked like the solution.

Instead of the Von Neumann pattern (random weight fetches from slow HBM into compute cores), weights stream coherently through the chip’s internal SRAM fabric. No HBM, no cache misses, no stalls. Cerebras’ WSE-3 keeps activations on-chip and streams weights through 21 petabytes/second of internal bandwidth. Groq’s LPU goes further: all weights fixed in on-chip SRAM, zero HBM, deterministic throughput.

Von Neumann (GPU cluster):
TENSOR CORES
    ↕ 33 TB/s
L1 SRAM (256KB/SM)
    ↕ 12 TB/s
L2 CACHE (~50MB)
    ↕ 3.35 TB/s (H100) / 8 TB/s (B200)
HBM (weights live here)
    ↕ ~900 GB/s NVLink
OTHER GPU HBM (model parallel)

Wafer Scale (Cerebras WSE-3):
900K CORES, each with 48KB SRAM, 21 PB/s aggregate on-wafer
    ↕ weight streaming only (large models)
MemoryX (weights live here for large models)

Bottleneck	GPU	Wafer Scale
Weight access	Stream from HBM: 3.35–8 TB/s	Stream through on-wafer mesh: 21 PB/s (~2,600×)
Activation storage	Competes with weights in HBM	Dedicated on-chip SRAM, never evicted
KV cache	HBM-paged, scattered, cache-hostile	On-chip SRAM, no access penalty
Cache miss penalty	~10× bandwidth drop L2→HBM	No off-chip penalty during compute phase

The 0.06–1.6% L2 hit rate problem does not exist on WSE-3 during the compute phase. Because there is no L2-to-HBM boundary to cross.

Everyone was hyped. Tesla, Groq, Cerebras, Etched, SambaNova: billions in funding, multiple IPO attempts, hyperscaler interest. Wafer scale was supposed to dethrone Von Neumann.

Then most of them died.

Tesla Dojo: axed August 2025. Musk: “Once it became clear that all paths converged to AI6, I had to shut down Dojo and make some tough personnel choices, as Dojo 2 was now an evolutionary dead end.” Tesla signed a $16.5B deal with Samsung for conventional AI chips instead.

Groq: acquired by Nvidia for $20B (December 2025). The purest expression of the “weights don’t touch HBM” insight, now an Nvidia product line.

Etched AI: raised $500M at $5B valuation, transformer computation hardwired into silicon, claimed 90%+ FLOPS utilization. Still not publicly shipping, 20+ months after announcement.

Tenstorrent (Jim Keller): not dead, but pivoted hard to IP licensing rather than selling chips. $693M raised, $2.6B valuation, $150M+ in contracts with LG, Hyundai, Samsung.

Nvidia is still the undisputed king. What happened?

Two separate failure modes, usually conflated.

Intrinsic problems with wafer scale. Dead elements are unavoidable at wafer scale; you have to route around them. It took years to engineer a working solution. Cerebras got there (tiny 0.05mm² cores, redundant mesh routing, 70,000 spare cores kept in reserve), but it was genuinely hard, and not everyone did.

The SRAM ceiling is the cruel irony. WSE-3 has 44 GB of on-chip SRAM, not enough for frontier models. When a model exceeds this, Cerebras streams weights from external MemoryX through a 150 GB/s off-chip link. Compare that to Nvidia NVLink’s 1.8 TB/s. The chip that was supposed to eliminate the memory bottleneck, constrained by a slow off-chip link, because the models grew faster than the on-chip SRAM could keep up.

Why the competition held up better than expected. Von Neumann turned out to be surprisingly good at fighting back. Overlapped layer streaming (while layer i computes, layer i+1 weights prefetch) reportedly “completely hides load latency.” Triple-level pipelining runs tensor core ops, shared memory transfers, and global memory prefetch simultaneously. FlashAttention cut HBM accesses for attention by ~9×. FP4/INT4 quantization cut bytes moved. These don’t change the fundamental roof-line; they make the wrong architecture more competitive, not correct.

But the decisive factor was timing. 2020–2024 was dominated by large training runs. Training is compute-bound, high arithmetic intensity, the one regime where Von Neumann tensor cores and CUDA are genuinely competitive. Cerebras had no credible large-model training story until MemoryX. When hyperscalers were evaluating infrastructure, wafer scale had nothing to offer them. Add a 20-year CUDA moat: every framework, every paper, every researcher assumes it, and switching means rewriting the entire stack. Architectural superiority is irrelevant when tooling compatibility costs a year of engineering.

Von Neumann won the era it was actually suited for. Wafer scale arrived late, without a training story, into an ecosystem with no reason to switch.

None of that changes the physics.

The arithmetic intensity mismatch during decode remains. Nvidia’s mitigations paper over it:

Mitigation	Effect	Limit
FP4/INT4 quantization	Cuts bytes moved; raises effective AI	Decode AI still ~4 FLOP/byte max
Larger L2 (50→96MB, H100→B200)	Marginally better hit rate	Dwarfed by model size; decode L2 hit stays near-zero
HBM3e bandwidth (8 TB/s B200)	~2.4× H100 bandwidth	Makes misses cheaper; doesn’t fix the miss rate
FlashAttention	~9× fewer HBM accesses for attention	Prefill/training benefit; decode KV-cache-streaming-bound regardless

And the workload is shifting. Agentic AI is decode-heavy by construction: long reasoning chains, tool calls, multi-turn conversations. The use case wafer scale was designed for is becoming the dominant production workload.

There’s also a subtler point about what gets streamed off-chip when models don’t fit on-wafer. Weights are accessed once per forward pass in a predictable, sequential pattern, ideal for prefetching. Activations and KV cache are accessed repeatedly and unpredictably, growing with context length, not just parameter count. Streaming weights off-chip while keeping activations on-chip is the correct partition. Von Neumann does the opposite by accident of history.

	GPU (von Neumann)	Cerebras WSE-3
What streams off-chip?	Weights + activations + KV cache; everything competes for HBM	Weights only, layer-by-layer
Activations	HBM, evicted and reloaded	On-chip SRAM, never leave the wafer
KV cache	HBM-paged, scattered, cache-hostile	On-chip SRAM, no access penalty
Access pattern	Random KV access causes cache thrashing	Deterministic layer-by-layer weight streaming

As context windows grow, this gets worse for GPUs, not better.

So who’s still standing?

Cerebras is the main survivor. Revenue: $25M (2022) → $290M (2024) → $510M (2025), 76% YoY growth. IPO’d May 2026. Multi-year OpenAI agreement. Reports 2–21× faster inference than B200 on supported models. The caveats are real: still operating at a loss, extreme customer concentration (~86% from G42/UAE-linked entities), and the MemoryX bandwidth ceiling for very large models. But the company is alive, growing, and winning real contracts.

Tenstorrent (Jim Keller) is worth separating from the wafer-scale story; it’s not wafer-scale. Tensix cores are chiplet-based programmable dataflow. The company is surviving through IP licensing: $150M+ in contracts with LG, Hyundai, and Samsung, with Intel and Qualcomm reportedly interested as acquirers. A third path: not winning on chips, but keeping the architecture idea alive through licensing.

SambaNova: Reconfigurable Dataflow Units, still shipping, $5B valuation, claims 5× faster than competing chips for agentic workloads. The quietest survivor.

Etched: the boldest bet. Transformer computation baked permanently into silicon. Either it ships and changes the conversation, or it becomes the most expensive yield problem in recent memory.

Von Neumann refuses to die. But it’s running on CUDA lock-in and training-era inertia, not on being correct for the workload that now dominates. The physics argument for wafer scale doesn’t go away. The SRAM ceiling is a real constraint, but it’s an engineering problem, not an architectural one. And with every passing quarter of agentic workloads, the regime where GPU is structurally wrong grows larger.

Sources: arXiv:2503.08311 · arXiv:2504.06319 · arXiv:2512.01644 · arXiv:2512.02189 · arXiv:2409.00287 · arXiv:2602.18568 · Cerebras architecture deep-dive · Meta Llama 3 Herd of Models