A Riff on "The Slow Death of Scaling"

Sara Hooker’s essay “On the Slow Death of Scaling” (2026) is a thought-provoking piece that deserves a careful read.¹ What follows is my interpretation (or perhaps more accurately, a riff) on her arguments, where I’ll both steelman her position and push back where I think the evidence points elsewhere.

Scaling Laws Require Controlled Environments

Scaling laws hold very well in controlled environments, but any individual scaling law depends on various factors. More generally, we have multiple axes of improvement available to us: data quality, model architecture, optimization techniques, and other algorithmic advances.² Each of these can shift and bend scaling laws to a great extent and improve performance significantly.

These improvements have allowed us to see incredible gains in a short amount of time. Indeed, on many leaderboards, newer small models beat older, much larger ones. Hooker’s examples are striking: Falcon 180B (2023) was outperformed by Llama-3 8B (2024) just one year later.³ The Aya 23 8B and Aya Expanse 8B models easily outperform BLOOM 176B despite having only 4.5% of the parameters.⁴

But Is Scaling Actually Plateauing?

However, despite the essay’s titular claim, we see no evidence that “scaling” itself is actually plateauing. We do not know the sizes of closed frontier models, but we do know that increasingly large open-source frontier models continue to be published and remain ahead of smaller ones from the same generation.

Regarding this claim, it would be easy to shout “skill issue” here, but that is not true. The lack of information is the real issue—particularly for academia. The research community has become fragmented due to the lack of knowledge sharing, a trend Hooker herself documents and laments.⁵

Scaling vs. Scaling Laws: A Crucial Distinction

At the same time, we must not confound “scaling” with “scaling laws.” Scaling laws are fundamentally about legibility.⁶ To train big models, we need their performance to be predictable, and for that, we need to work hard to engineer scaling laws that hold.

I say “engineer” deliberately, because, as noted by Hooker, scaling laws require controlled environments: specific weight initialization, optimizer settings, and many other factors. Maintaining this legibility requires both research and engineering effort. When Hooker cautions that scaling laws have “only been shown to hold when predicting a model’s pre-training test loss” and break down for downstream tasks,⁷ she’s mainly identifying an interesting area of future research on how to establish legibility for downstream tasks. And likely, a future research direction whose fruits will not be published much by closed frontier labs.

The Economics of Predictability

While the essay invokes Warren Buffett’s advice “Don’t ask the barber if you need a haircut” when warning against trusting too much in scaling laws, there’s a compelling counter-argument: scaling laws are “cheap” in human research hours precisely because they are low in surprises. For exactly that reason, scaling compute currently trades off well against other axes of improvement. When you can predict with reasonable confidence what another 10x of compute will buy you, the investment becomes much easier to make.

I’d also note, somewhat cheekily, that one should not ask someone at a likely compute-constrained lab whether compute scaling is hitting a wall: you already know what they’ll say.⁸

The Tension Between Engineering and Research

More seriously, this draws on a real tension between engineering and research. A lot of what is done at the frontier of LLMs is closer to R&D engineering than traditional green-field research, precisely because of the benefits that come from legibility and predictability. This is very different from the free-form research deep learning enjoyed in the 2010s, when a graduate student with a clever idea and a single GPU could sometimes change the field. This is still possible, but the diffusion of ideas can take longer and requires much more effort as the field has grown considerably larger.

Where I Agree: Small Models Are Remarkably Better

Small models are unbelievably better than they were even a short while ago. The Open LLM Leaderboard data Hooker presents is genuinely striking. The systematic pattern of smaller models outperforming larger ones over time is real and important to be aware of.⁹

For context: when GPT-4 launched in March 2023, it scored 86.4% on MMLU and was hailed as a breakthrough. Today, Qwen3-4B, a 4 billion parameter model that comfortably runs on a single laptop, scores 83.7% on MMLU-Redux, nearly matching GPT-4’s flagship result with roughly 400× fewer parameters. On harder benchmarks, the gap inverts: Qwen3-4B achieves ~97% on MATH-500 (vs. GPT-4’s ~52%) and 55.9% on GPQA Diamond (vs. GPT-4’s 35.7%).¹⁰

If this trajectory continues, we will soon have models small enough for individuals to run locally yet capable enough to generate substantial real-world value.

But in my view, bigger models will still perform better at the frontier and will remain both necessary and useful. The examples of small models beating large ones are all cases of newer small models beating older large ones. At any given moment in time, with the same algorithmic improvements applied, scale still appears to help.¹¹

Consider the evidence: within the Llama 3.1 family, the 70B model consistently outperforms the 8B across all benchmarks, and the 405B outperforms the 70B (though with diminishing returns at the top end).¹² Even heavily quantized versions of larger models tend to beat unquantized smaller ones. The pattern is robust.

What about cases where smaller frontier models match or exceed larger ones? Here the mechanism matters. Google’s technical report confirms that Gemini 2.5 Flash (and earlier Flash models) are explicitly distilled from Pro where the smaller model learns from the larger one’s “next token prediction distribution” during training.¹³ When Claude 3.5 Sonnet exceeded Claude 3 Opus on many benchmarks, Anthropic’s documentation describes it as an “evolution” with improvements in “training and elicitation techniques” but does not confirm distillation.¹⁴ The diminishing returns are real, but it may not demonstrate that scale has stopped mattering; it may demonstrate that smaller models can inherit capabilities from larger ones through sophisticated cross-training techniques.

The Path Forward

Yet innovation is clearly needed—both to further improve scaling efficiency and to innovate beyond scaling entirely. The essay helpfully points to several promising directions: data synthesis and treating the data space as malleable rather than fixed, prompt-only gradient-free approaches that shift compute from training to inference, and more holistic approaches to human-AI interaction.¹⁵

And, of course, continual learning, which Hooker identifies as a key limitation of current transformer architectures.¹⁶ Richard Sutton, author of the original “Bitter Lesson” that Hooker engages with, made a related point in his September 2025 interview with Dwarkesh Patel: LLMs learn from imitation of human data rather than from experience, and this may be their fundamental scaling limitation. As Sutton put it, “The scalable method is you learn from experience. You try things, you see what works. No one has to tell you.”¹⁷ If models could learn continuously from experience without catastrophic forgetting, the economics of scaling would shift dramatically.

Conclusion

Hooker’s essay is valuable precisely because it forces us to articulate what we actually believe about the relationship between compute and capabilities.

My view is that the evidence supports a more nuanced position than “scaling is dying”: scaling continues to work. It is monotonic, even if subadditive, and we keep being surprised by what larger models can do. However, scaling is not the only game in town. It never was. There have been continuous algorithmic innovations throughout the scaling era after the invention of Transformers, including Mixture-of-Experts, Flash Attention, and many more. No one is betting exclusively on scale while ignoring other axes of improvement as that would be a strategic mistake.

This essay is an expanded and improved version of an earlier Twitter thread. Claude 4.5 Opus was used to expand and polish the text.

Hooker, Sara. “On the Slow Death of Scaling.” SSRN, 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5877662. The essay argues that the relationship between training compute and performance is “highly uncertain and rapidly changing.”↩︎
This framing aligns with Leopold Aschenbrenner’s “Situational Awareness” (June 2024), which decomposes AI progress into compute, algorithmic efficiencies, and “unhobbling” gains. Aschenbrenner estimates algorithmic progress contributes ~0.5 OOMs per year, matching compute scaling’s contribution. However, Aschenbrenner remains aggressively bullish on scaling (predicting AGI by 2027 via ~5 OOMs of effective compute scaleup), whereas Hooker argues “the next wave of intelligence will be defined by cost of adaptability, not model size.”↩︎
Hooker, Figure 3b and accompanying discussion. The Open LLM Leaderboard data shows systematic patterns of newer small models (<13B parameters) outperforming older large models (>13B).↩︎
Hooker cites Aryabumi et al. (2024) for Aya 23 and Dang et al. (2024b) for Aya Expanse, comparing against BLOOM 176B (Big Science Workshop, 2023).↩︎
Hooker notes that “industry labs have stopped publishing” and “the pervasive view in industry labs is to limit publishing to preserve important commercial advantages,” citing Heikkilä & Morris (2025), an FT article. The article reports that DeepMind imposed a six-month embargo on certain papers and requires multiple internal approvals before publishing. The Stanford AI Index Report 2025 (p. 48) found that industry developed 55 notable AI models in the reporting period while academia released none.↩︎
I have learnt about this term from https://www.notboring.co/p/hyperlegibility, which was recommended by Julian Schrittwieser somewhere.↩︎
Hooker writes: “One of the biggest limitations of scaling laws is that they have only been shown to hold when predicting a model’s pre-training test loss… when actual performance on downstream tasks is used, the results are often murky or inconsistent.”↩︎
It’s worth noting that Hooker recently left Cohere For AI to found Adaption Labs, a startup explicitly “betting against the scaling race” and focused on efficient adaptation and continuous learning. See: “Why Cohere’s ex-AI research lead is betting against the scaling race,” TechCrunch, October 2025.↩︎
Hooker, Figure 3: “Plot of the best daily 13B or smaller model submitted to the Open LLM leaderboard over time. Even amongst comparable small sized models, performance has been growing rapidly.”↩︎
GPT-4’s MMLU score of 86.4% (5-shot) is from OpenAI’s March 2023 technical report. Qwen3-4B scores 83.7% on MMLU-Redux, ~97% on MATH-500, and 55.9% on GPQA Diamond (all in thinking mode). GPT-4 scored ~52% on MATH and 35.7% ±2.4% on GPQA Diamond. GPT-4’s parameter count is rumored to be ~1.7 trillion. Sources: Qwen3 blog, Epoch AI GPQA benchmark, Qwen3 technical report.↩︎
This is arguably a tautology: if you hold everything else constant, of course more compute helps. But the point is that we don’t actually observe cases where, within the same model family and generation, smaller models outperform larger ones.↩︎
See Meta’s Llama 3.1 benchmarks: the leap from 8B to 70B shows substantial improvement across MMLU, AGIEval, CommonSenseQA, and other benchmarks. The jump from 70B to 405B is smaller but consistent. One analysis notes: “the leap from 8B to 70B often shows a more substantial improvement compared to the jump from 70B to 405B, suggesting diminishing returns at higher parameter counts.”.↩︎
Google’s Gemini 2.5 Technical Report (§2.1) explicitly confirms distillation for Flash models: “The smaller models in the Gemini 2.5 series — Flash size and below — use distillation… To reduce the cost associated with storing the teacher’s next token prediction distribution, we approximate it using a k-sparse distribution over the vocabulary… we find this to be a worthwhile trade-off given the significant quality improvement distillation has on our smaller models.” The report notes this was “done in the Gemini 1.5 series” as well. For Gemini 3 Flash, Google’s announcement describes it as “building on the reasoning capabilities of Gemini 3 Pro” (source), and third-party analysts interpret this as continuing the same distillation methodology (see analysis). Interestingly, Gemini 3 Flash scores 78% on SWE-bench Verified versus Pro’s 76.2% but their testing suggests “the distilled model is consistently inferior to the full model” on real-world tasks.↩︎
The Claude 3.5 Sonnet model card describes incremental improvements in “training and elicitation techniques” but does not mention distillation from Opus. Amazon Bedrock does offer distillation as a customer-facing service using Claude 3.5 Sonnet as a “teacher” to improve Claude 3 Haiku (see news), but this is distinct from how Anthropic trains its core models.↩︎
See Hooker’s Section 5.1 on “New Optimization Spaces,” which discusses gradient-free exploration, malleable data spaces through synthetic data generation, and the increasing importance of design and interface.↩︎
Hooker writes: “As our models interact with the world, we need new ways to mitigate catastrophic forgetting, where performance deteriorates on the original task because new information interferes with previously learned behavior.”↩︎
Richard Sutton, Dwarkesh Patel podcast, September 26, 2025. https://www.dwarkesh.com/p/richard-sutton. Sutton argues this is consistent with his Bitter Lesson: LLMs use massive computation (good) but rely on finite human data rather than scalable experience-based learning (limiting). “What we want, to quote Alan Turing, is a machine that can learn from experience, where experience is the things that actually happen in your life… The large language models learn from something else.”↩︎