HOISTED/CROSSPOST: Richard S. Sutton: The Bitter Lesson
The 2019 note on the power of computation: revisiting Sutton’s Bitter Lesson in the age of GPT MAMLMs, which have proven the Bitter Lesson much truer than Sutton would have dared imagine in areas...
The 2019 note on the power of computation: revisiting Sutton’s Bitter Lesson in the age of GPT MAMLMs, which have proven the Bitter Lesson much truer than Sutton would have dared imagine in areas broadly describable as “search” & “learning” where very big-data high-dimension flexible-function & simulation-exploring prove their metal. But are there limits &, if so, where are they? We need high-quality thought to explore what the Bitter Lesson means now…
Lots of pieces or patterns of information or things about the world that you want to finely classify so that you can then evaluate and take action. It is a big world out there—a very big-data world. And you would want to have the potential to group its features in many different ways—a very high-dimensional analysis. Moreover, some small differences make important differences for evaluation and then action, and other large differences do not—flexible-function analysis. Plus there is the potential for large-scale large-breadth simulations of systems as a way of exploring problem spaces: what shows up in computer chess- and go-playing mode as improvement via “self play”, constructing new data of what would have happened in situations that were never seen in real life, but might have been.
It is for these that improvements in computation produce powerful and effective improvements in what computers can do. That is the bitter lesson: with enough computational power available, general-purpose methods will find ways of attacking problems that were foreclosed by premature commitments to human thought-simulacrum strategies of how to look for patterns. With enough computational power. But you can bet there will be more in five years. And for situations like or close enough to search and learning for which very big-data high-dimension flexible-form simulation-exploration classification and evaluation analyses are well-suited, Rich Simmons’s Bitter Lesson certainly does hold: “general methods that leverage computation are ultimately the most effective, and by a large margin”. Human-centric methods based on human priors on how to attack problems often quickly plateau, while more general-purpose approaches amenable to easy scaling-up do not.
But what are the limits of that set of problems analogous enough to search and to learning? And what should we do in order to successfully attack other problems when and where this latest set of advances in GPT MAMLM—General-Purpose Transformer Modern Advanced Machine-Learning Model—simulated cognition methods run into diminishing returns?
For 70 years, many of “AI”’s leaps have come from embracing scale, not baking in human understanding. Sutton’s insight—compute plus general methods—keeps winning. But within what problem-space bounding frontier? Identifying the frontier and designing tools for what lies beyond may produce a new and different Bitter Lesson.
A great deal, indeed, hangs on our figuring out good answers to such questions.
Rich Sutton back in 2019, complete and entire:
Rich Sutton (2019): The Bitter Lesson <http://www.incompleteideas.net/IncIdeas/BitterLesson.html>: ‘The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore’s law, or rather its generalization of continued exponentially falling cost per unit of computation.
Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.
There were many examples of AI researchers’ belated learning of this bitter lesson, and it is instructive to review some of the most prominent:
[1] In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that “brute force” search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.
[2] A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self-play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion).
Learning by self-play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research.
In computer Go, as in computer chess, researchers’ initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.
[3] In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field.
The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.
[5] In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run.
The bitter lesson is based on the historical observations that:
1) AI researchers have often tried to build knowledge into their agents,
2) this always helps in the short term, and is personally satisfying to the researcher, but
3) in the long run it plateaus and even inhibits further progress, and
4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
One [general] thing that should be learned from the bitter lesson is the great power of general-purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.
Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done…
References:
Manik, Ismail Ali. 2025. “Richard Sutton contends…” Twitter, September 26. <https://x.com/iamaniku/status/1971642450835067381>.
Marcus, Gary. 2025. “Game over for pure LLMs. Even Turing Award Winner Rich Sutton has gotten off the bus”. Marcus on AI, September 26. <https://garymarcus.substack.com/p/game-over-for-pure-llms-even-turing>.
Dwarkesh Patel: 2025. “Richard Sutton, father of reinforcement learning…” Twitter, September 26. <https://x.com/dwarkesh_sp/status/1971606180553183379>
Patel, Dwarkesh & Richard Sutton. 2025. “Richard Sutton—Father of RL thinks LLMs are a dead end”. Dwarkesh Podcast, September 26. <https://www.youtube.com/watch?v=21EYKqUsPfg>.
Sutton, Rich. 2019. “The Bitter Lesson”. Incomplete Ideas, March 13. <http://www.incompleteideas.net/IncIdeas/BitterLesson.html>.



It seems to me that the "bitter lesson" is a particular example of a more general principle, which you might call "make it up in volume crossed with less is more". It underlies the industrial revolution, for example: early machine-made artifacts were usually inferior to their hand-made counterparts, but they were so vastly cheaper that one could have many, many more machine-made artifacts than hand-made. The technology has since developed to the point that machines can make artifacts that we could never duplicate by hand.
The particular example that fascinates me, even more than AI, is the alphabet (the subject of the Logarithmic History post for 27 September.) Doug Jones:
"To any Egyptian scribes who saw it, this script must have looked like a laughably dumbed-down version of hieroglyphs, stripping signs of their meanings, then painfully spelling out with a whole string of characters what hieroglyphs often managed in a single sign or two."
[...]
"Logographic writing systems like Egyptian hieroglyphs were devised multiple times: in Mesopotamia, China, and Meso-America. The alphabet however, was invented just once."
Just because an idea is simple and destined for outstanding success doesn't mean it is easy or obvious.
Gary Marcus was gleeful that Sutton has revoked the idea that "infinite scaling" will solve all AI (LLM) issues. While true for the LLM approach, that doesn't mean that scaling won't work for a different AI approach.
Consider a much simpler technology, expert systems, and decision trees. Rule-based systems, common in the symbolic AI era, started with hand-crafted expert systems. They proved brittle and failed to capture the nuances. The result is that they went into the dustbin of AI techniques, although they are still the core of 1980s books on Prolog. Then, class separation became the focus, from "What decision to take [yes/no] and [which choice]" and object classification [in/out of class] and [which class?])" The ID3 and later CART algorithms were core to a machine learning approach that leveraged the computational effort of computers. Provide a table of variable states/values, with the outcome/classification, and computers would very quickly create a pruned Decision Tree. It wasn't very fast with old Intel 8088 CPUs, but today, a decent PC can produce a result in less than a second for tables of 100s of rows and 10s of variables. But DTs were still somewhat brittle. But then increased computational power allowed the DT algorithm to be used with altered input tables (leave out some variables, or table rows, repeat the algorithm many times, and use the majority classification results. This is called a Random Forest. This proved far better with performance that was still very fast, and with results more comparable with ANNs which remained computationally heavyweight. RFs still remain my favoured ML approach for a quick result to determine if the data has structure. It also reminds me of Calvin's "The Cerebral Code" on how the brain (cerebrum) works.
But clearly, Random Forests have limits on problems they can solve. Multilayer Perceptrons (ANNs) are similar. Great at classifying, but not at other tasks. Transformer and Attention architectures allowed the basics of Large Language Models (LLMs), enabling computers to handle the use of language. They have proven remarkably versatile in solving (at least partially), some problems that were unexpected. But now scaling has shown that further improvements are at the upper bounds of the logistic curve of the technology's problem solving. As Marcus opines, LLMs do not understand the world, and never will. This doesn't mean scaling is no longer generally operative, just that scaling will no longer work for this particular technology. LeCun has already indicated that we need a new AI technology to go further/ hassibis, focusing more on scientific problems wants more neurosymbolic hybrid technology. We don't yet know what the next breakthrough will be, but when we get it, scaling will likely start the next logistic (S-curve) of gains.