Sara Hooker’s essay “On the Slow Death of Scaling” (2026) is a thought-provoking piece that deserves a careful read.1 What follows is my interpretation (or perhaps more accurately, a riff) on her arguments, where I’ll both steelman her position and push back where I think the evidence points elsewhere.

Scaling Laws Require Controlled Environments

Scaling laws hold very well in controlled environments, but any individual scaling law depends on various factors. More generally, we have multiple axes of improvement available to us: data quality, model architecture, optimization techniques, and other algorithmic advances.2 Each of these can shift and bend scaling laws to a great extent and improve performance significantly.

These improvements have allowed us to see incredible gains in a short amount of time. Indeed, on many leaderboards, newer small models beat older, much larger ones. Hooker’s examples are striking: Falcon 180B (2023) was outperformed by Llama-3 8B (2024) just one year later.3 The Aya 23 8B and Aya Expanse 8B models easily outperform BLOOM 176B despite having only 4.5% of the parameters.4

But Is Scaling Actually Plateauing?

However, despite the essay’s titular claim, we see no evidence that “scaling” itself is actually plateauing. We do not know the sizes of closed frontier models, but we do know that increasingly large open-source frontier models continue to be published and remain ahead of smaller ones from the same generation.

Regarding this claim, it would be easy to shout “skill issue” here, but that is not true. The lack of information is the real issue—particularly for academia. The research community has become fragmented due to the lack of knowledge sharing, a trend Hooker herself documents and laments.5

Scaling vs. Scaling Laws: A Crucial Distinction

At the same time, we must not confound “scaling” with “scaling laws.” Scaling laws are fundamentally about legibility.6 To train big models, we need their performance to be predictable, and for that, we need to work hard to engineer scaling laws that hold.

I say “engineer” deliberately, because, as noted by Hooker, scaling laws require controlled environments: specific weight initialization, optimizer settings, and many other factors. Maintaining this legibility requires both research and engineering effort. When Hooker cautions that scaling laws have “only been shown to hold when predicting a model’s pre-training test loss” and break down for downstream tasks,7 she’s mainly identifying an interesting area of future research on how to establish legibility for downstream tasks. And likely, a future research direction whose fruits will not be published much by closed frontier labs.

The Economics of Predictability

While the essay invokes Warren Buffett’s advice “Don’t ask the barber if you need a haircut” when warning against trusting too much in scaling laws, there’s a compelling counter-argument: scaling laws are “cheap” in human research hours precisely because they are low in surprises. For exactly that reason, scaling compute currently trades off well against other axes of improvement. When you can predict with reasonable confidence what another 10x of compute will buy you, the investment becomes much easier to make.

I’d also note, somewhat cheekily, that one should not ask someone at a likely compute-constrained lab whether compute scaling is hitting a wall: you already know what they’ll say.8

The Tension Between Engineering and Research

More seriously, this draws on a real tension between engineering and research. A lot of what is done at the frontier of LLMs is closer to R&D engineering than traditional green-field research, precisely because of the benefits that come from legibility and predictability. This is very different from the free-form research deep learning enjoyed in the 2010s, when a graduate student with a clever idea and a single GPU could sometimes change the field. This is still possible, but the diffusion of ideas can take longer and requires much more effort as the field has grown considerably larger.

Where I Agree: Small Models Are Remarkably Better

Small models are unbelievably better than they were even a short while ago. The Open LLM Leaderboard data Hooker presents is genuinely striking. The systematic pattern of smaller models outperforming larger ones over time is real and important to be aware of.9

For context: when GPT-4 launched in March 2023, it scored 86.4% on MMLU and was hailed as a breakthrough. Today, Qwen3-4B, a 4 billion parameter model that comfortably runs on a single laptop, scores 83.7% on MMLU-Redux, nearly matching GPT-4’s flagship result with roughly 400× fewer parameters. On harder benchmarks, the gap inverts: Qwen3-4B achieves ~97% on MATH-500 (vs. GPT-4’s ~52%) and 55.9% on GPQA Diamond (vs. GPT-4’s 35.7%).10

If this trajectory continues, we will soon have models small enough for individuals to run locally yet capable enough to generate substantial real-world value.

But in my view, bigger models will still perform better at the frontier and will remain both necessary and useful. The examples of small models beating large ones are all cases of newer small models beating older large ones. At any given moment in time, with the same algorithmic improvements applied, scale still appears to help.11

Consider the evidence: within the Llama 3.1 family, the 70B model consistently outperforms the 8B across all benchmarks, and the 405B outperforms the 70B (though with diminishing returns at the top end).12 Even heavily quantized versions of larger models tend to beat unquantized smaller ones. The pattern is robust.

What about cases where smaller frontier models match or exceed larger ones? Here the mechanism matters. Google’s technical report confirms that Gemini 2.5 Flash (and earlier Flash models) are explicitly distilled from Pro where the smaller model learns from the larger one’s “next token prediction distribution” during training.13 When Claude 3.5 Sonnet exceeded Claude 3 Opus on many benchmarks, Anthropic’s documentation describes it as an “evolution” with improvements in “training and elicitation techniques” but does not confirm distillation.14 The diminishing returns are real, but it may not demonstrate that scale has stopped mattering; it may demonstrate that smaller models can inherit capabilities from larger ones through sophisticated cross-training techniques.

The Path Forward

Yet innovation is clearly needed—both to further improve scaling efficiency and to innovate beyond scaling entirely. The essay helpfully points to several promising directions: data synthesis and treating the data space as malleable rather than fixed, prompt-only gradient-free approaches that shift compute from training to inference, and more holistic approaches to human-AI interaction.15

And, of course, continual learning, which Hooker identifies as a key limitation of current transformer architectures.16 Richard Sutton, author of the original “Bitter Lesson” that Hooker engages with, made a related point in his September 2025 interview with Dwarkesh Patel: LLMs learn from imitation of human data rather than from experience, and this may be their fundamental scaling limitation. As Sutton put it, “The scalable method is you learn from experience. You try things, you see what works. No one has to tell you.”17 If models could learn continuously from experience without catastrophic forgetting, the economics of scaling would shift dramatically.

Conclusion

Hooker’s essay is valuable precisely because it forces us to articulate what we actually believe about the relationship between compute and capabilities.

My view is that the evidence supports a more nuanced position than “scaling is dying”: scaling continues to work. It is monotonic, even if subadditive, and we keep being surprised by what larger models can do. However, scaling is not the only game in town. It never was. There have been continuous algorithmic innovations throughout the scaling era after the invention of Transformers, including Mixture-of-Experts, Flash Attention, and many more. No one is betting exclusively on scale while ignoring other axes of improvement as that would be a strategic mistake.

This essay is an expanded and improved version of an earlier Twitter thread. Claude 4.5 Opus was used to expand and polish the text.