Minimizing Loss ≠ Maximizing Intelligence
(Cross-posted from my Substack; written as part of the Halfhaven virtual blogging camp.)
Many speculate about the possibility of an AI bubble by talking about past progress, the economy, OpenAI, Nvidia, and so on. But I don’t see many people looking under the hood to examine whether the actual technology itself looks like it’s going to continue to grow or flatline. Many now realize LLMs may be a dead end, but optimism persists that one clever tweak of the formula might get us to superintelligence. But I’ve been looking into the details of this AI stuff more lately, and it seems to me that there’s a deeper problem: self-supervised learning itself.
Here’s how supervised learning with gradient descent works, by my understanding:
Give the neural network some input, and it returns some output.
We score how “bad” the output is.
We update the model’s weights in directions that would have produced less bad output, making it less bad next time
This works great when you can judge badness reliably. AlphaGo Zero used a cleverly-designed oracle to evaluate its outputs, essentially comparing the move the model thought was the best with the real best move. But modern LLMs work differently. We have them complete a snippet of training data, and compare their output with the real completion. This is called self-supervised learning. By training the model this way, we minimize loss with respect to the training data, thereby creating an AI model that’s really good at predicting the next token of any snippet of training data, and hopefully other similar data.
By doing this, we create a model which tries to remember all patterns present in the data, however arbitrary. Common patterns get prioritized because they help minimize loss more, but the only way to minimize loss is to learn as many patterns as you can. That will include some patterns humans care about, and many more we do not.
Self-supervised learning is not a blind memorizer. It does abstract and generalize. But it abstracts indiscriminately.
Here’s the problem. Let’s say I want to train an AI model that can beat any human at chess. I train it on the history of all recorded chess games, including amateur games, master games, and grandmaster games. Feed it some number of opening moves and have it predict the next move. We update the model using self-supervised learning based on accuracy.
Training my AI model this way, it would learn to play well. It would also learn to play poorly. It would learn the playstyle of every player in the data. It would learn to use the King’s Indian Defense if the game was played in the ’60s, but probably not if the game was in the ’90s. It would learn what I wanted, and orders of magnitude more that I didn’t care about.
The history of all recorded chess games is several gigabytes, but Stockfish, including the heuristics it uses to evaluate moves, can fit in 3–4 MB. This is at least a 1000x difference between the information we care about (some winning strategy) and the total information in the training data.
Keep in mind that when chess officials wrote down the moves for a chess game, they were implicitly throwing away most of the data for us, like whether the pieces were made of wood or plastic, or whether so-and-so happened to cough before making a move. Not all datasets are this refined to exactly what we want the AI to learn. If you were unlucky enough to have to learn chess from videos of chess matches, the ratio of noise to important data would be like 1,000,000x or 1,000,000,000x. Yet even in the case of chess notation data, most of the information is not worth holding on to.
Now expand this from chess to every domain. Most patterns in most data will be worthless. Most patterns in reality itself are worthless. Humans discard almost all the data we perceive. Our intelligence involves discrimination. Models trained by self-supervised learning like LLMs, on the other hand, try to stuff as much of reality into their weights as possible. An LLM might know a lot about chess, since there’s a lot of chess-specific training data, but only a small amount of what it knows will be about winning chess. That’s why it’s sometimes hard to get peak performance out of an LLM. It won’t necessarily give you the best moves it can unless you tell it to pretend it’s Magnus Carlsen. It knows how to play chess kinda well, but also kinda poorly, and it doesn’t know which one you want unless you specify.
A 7-year-old child given an addition problem learns from it, but given a calculus problem, they simply ignore it. They won’t try desperately to memorize shapes of symbols they don’t understand. We remember what matters and discard the rest.
What matters depends on context and values. The wood grain pattern on my hardwood living room floor is irrelevant if I’m having a conversation about politics, but critical if I’m painting a picture of the room. It takes judgement to know what to focus on. The ability to focus is how we make sense of a very complex world. If remembering everything relevant were easy, then evolution would have let us do so. Instead, we’re forced to remember based on what we think is important.
Human intelligence is neither specialized to a single domain, nor fully general, like reality-stuffing LLMs. Human intelligence is something else. Call it specializable intelligence. We’re specialized in our ability to tactically learn new information based on our existing knowledge and values.
Some imagine superintelligence as a magical system that could play chess for the first time at a grandmaster level, having only seen the rules, deducing winning strategies through pure, brilliant logic. This is impossible. Chess is computationally irreducible. Many games must be played, whether in reality or in some mental simulation of games (or sub-game patterns). Existing knowledge of Go or checkers or “general strategy” will not really help. You can’t have an AI model that’s just good at everything. Not without a computer the size of the universe. What you want is an AI that can get good at things as needed. A specializable intelligence.
There is a tradeoff between a fully general intelligence and a specialized intelligence. The “no free lunch” theorem states that for any AI model, improvements on one class of problems come with worse performance on other classes of problems. You either stay general, or specialize in some areas at the cost of others.
This implies that, for fixed compute, a general intelligence will perform worse at the things we care about than a specialized intelligence could. Much worse, given just how much we don’t care about. Our goal should be specializable intelligence which can learn new things as needed, as well as some fundamentals humans care about often, like language, vision, logic, “common knowledge”, and so on. Creating general superintelligence would require literally astronomical compute, but specializable superintelligence would be far cheaper.[1]
Reality-stuffed general models that don’t discriminate what they learn we will never lead to superintelligent AI. Whatever superintelligence we achieve will not be general with respect to its training data. The chess example before was a contrived one. Keep in mind that we have a lot of good data for chess, and that chess is much less computationally complex than many tasks we care about.[2]An LLM might conceivably play chess well by overfitting to chess, but it won’t have similar performance on novel games similar to chess, and it will be helpless at more complex tasks.
Here are some approaches to AI that I’d guess can’t get us to superintelligent AI:
Just increasing compute. Diminishing returns (in useful capabilities) will set in. Loss may decrease predictably, but scaling laws measure the wrong objective.
Higher quality data. This will help, practically speaking, but most of the information in even really high quality data is going to be worthless/discardable. Imagine you cleaned up a chess dataset. You only included grandmaster games, for example. That’s still way more data than the Stockfish heuristics. Preparing “good” data is equivalent to extracting patterns you care about from that data, which in the limit requires the intelligence you’re trying to create.
Synthetic data. This boils off some noise from the original dataset, essentially creating a higher quality dataset with hopefully less information you don’t care about. Hopefully. But that’s all you’re doing.
Curriculum learning. When you heard about that 7-year-old who learned from the addition problem but ignored the calculus problem, you might have thought the solution to this whole problem was to order the data such that harder information comes after prerequisite easier data. This won’t work because the model is still being evaluated on completing the trianing data, so it still has to memorize whatever patterns are in the data, even ones we don’t care about. Maybe it’ll learn more quickly, but it’s what it’s learning that’s the problem. It may also lead to more unified internal world models, which is good, but not great if those world models are of things we don’t even care about.
Using another smaller LLM as an evaluator. Using a small model to judge how good or bad the output of a larger model-in-training is based on some metric humans care about won’t work, because it’s limited by the intelligence of the smaller model.
RLHF (reinforcement learning from human feedback): The model is already stupid by the time you apply RLHF. It’s constrained by the abstractions already learned.
Transformers and “attention”: Paying attention to different parts of a sentence when processing a token, and only paying attention to certain patterns humans care about in the data, both use the word “attention”, but they have nothing to do with each other. The model will still be penalized if it fails to predict the next token in the training data, which is a task that inherently requires memorizing a bunch of information humans don’t care about. Any architecture trained with respect to this goal will fail to scale to superintelligent AI. You might think that LLMs are already kind of specializable, because they can do “in-context learning” without any weight updates. But models think with their weights. The depth of thinking you can do in a domain without any learned patterns in the weights is limited. The whole point of the weights is to store abstractions so you can reason with them later. Depriving the model of the ability to do this makes it much stupider.
Neuro-inspired models with Hebbian learning. (Hebbian = “neurons that fire together wire together”, basically if neuron A firing leads to neuron B firing, the connection between the two is strengthened, as in the human brain). Even with more sophisticated stuff like spike-timing-dependent plasticity, the problem is that Hebbian learning reinforces whichever thought patterns already occur, but doesn’t teach the model to care about certain things.
Growing neural networks, making them larger as they train. If you’re using self-supervised learning, you’re still growing an idiot. I think this will make internal world models more unified as in the case of better training data ordering, but will not make the models care about only the patterns we want them to care about.
Meta-learning. Using an outer loop based on gradient descent or evolution or something, and an inner loop based on gradient descent. I read one paper where the model did expensive evolution in the outer loop to set up the initial conditions for learning. They then had the evolved models learn using gradient descent on some task. The models that learned better were then selected for the next generation of evolution. The hope was that you could evolve a model that’s predisposed to be good at learning arbitrary tasks. But it seems wasteful to me to do expensive evolution to set up the initial state of a network only to bowl over that network with backpropagation. Gradient descent minimizing loss with respect to training data will create a reality-stuffed model, regardless of the initial conditions. So you’re essentially evolving good initial conditions for an idiot.
Predictive coding: I haven’t looked into this much, but it seems like minimizing surprise is pretty similar to minimizing loss with respect to training data. Same problem: learning a bunch of patterns humans don’t care about.
Anything that improves “grokking”. The transition from memorization to understanding the underlying patterns in data is important, but this is true whether you’re trying to learn important things, like “how English works” or “how to win at chess”, or you’re trying to learn unimportant things, like “how terrible chess players tended to make mistakes in the ’70s”. Grokking is a sign that abstraction is happening, but it’s not sufficient for discriminatory intelligence.
Manually encoding human knowledge. E.g. putting human knowledge of words and phonemes into the model. The bitter lesson is still bitter.
Online learning. This is necessary, but not sufficient for superintelligence. A general, reality-stuffing model with online learning will be trying to cram way too much information to be as smart as we want it to be.
I don’t know what approaches could be more promising. Evolution of neuro-inspired models could work. We have at least one working example, at least: us. Evolution gave humans basic architecture and values that tell us what information we “should” pay attention to and care about. Then, during our lifetimes, Hebbian learning lets us learn specific knowledge in accordance with these values. Unfortunately, evolution is just very expensive. Is there a cheaper way forward? Probably, but I have no idea what it is.
One thing to keep in mind is that any more promising approach will necessarily lose the loss minimization game. Yet currently, “conventional approaches” are a gold standard to which other more experimental approaches are compared. If a new method can’t predict the next token of training data better than the conventional approach, it’s reported as a failure — or perhaps as “only slightly better than” the conventional approach, to satisfy the publication demands of academia.
This heuristic cannot stand. We don’t want general loss minimization with respect to training data. We want capability. Performance on novel games could be a valid benchmark. It could could also be used during training. You’d first create specializable intelligence that can learn arbitrary games, then teach it specific games like “speaking English”.
Novel games could also be used to operationalize the claim that useful capabilities will plateau even as loss continues to decrease. Specifically, I’d predict that performance on computationally complex novel games (at least as complex as chess) will barely improve as newer self-supervised models are released and continue to improve at traditional benchmarks. Novel games are a good benchmark because they prevent cheating if the training data happened to contain similar problems. A sufficiently novel game is unlike anything in the training data.
Self-supervised learning can only create general models, which are limited in their capability in any domain by trying to succeed in every possible domain. The trillion dollar bet on self-supervised models will not pay off, because these general models will continue to fail exactly where we need them the most — on novel, difficult problems.
- ^
François Chollet also pointed out the weakness of general intelligence, citing the “no free lunch” theorem, but he went too far, missing the specializability of human intelligence. It’s true that humans are specialized for a certain environment. Infants are born with certain reflexes, and certain knowledge. For example, the fusiform face area of the brain specialized for recognizing human faces. But even though we are partly specialized, we are also specializable. Give us any task and enough time, and we’ll outperform a random actor. For example, psychologists created objects called greebles that share a similar number of constraints as human faces, but look totally alien. They then trained some humans to become experts at recognizing greebles, and found they could reliably tell them apart, and found they used a holistic approach when viewing them rather than looking at their individual parts. In short, as long as we can extract patterns from data, and use those patterns to further refine our search for more patterns, we can do anything.
- ^
I understand why you doubt that anything is more data-efficient or compute-efficient than the human brain alone. The problem is that the AIs are raised on far more compute and training data. As I commented on Yudkowsky’s attempt to explain the danger of the ASI, the problem with the ASI is that it has far more training data and compute than a human, so even an algorithm hundreds of times weaker than a human would still cause the AI to learn more about the world than the human knows.
Returning to the proposals which you list as failing to create the ASI:
Just increasing compute. Scaling laws measure the wrong objective because loss is far easier to measure than benchmark performance, which is nonzero only for models large enough.
Higher quality data.Suppose that you would like the model to predict the idea in the next set of tokens (e.g. via the recent CALM technique where the LLM generates an entire sequence of tokens). Then one can, for example, ask a not-so-intelligent LLM to check whether the student’s idea and the real idea are the same.
Synthetic data. Synthetic datasets could also allow things like models trying to solve problems, then being trained on successful solutions.
Curriculum learning. Agreed.
Using another smaller LLM as an evaluator. The problem is that it’s far easier to check whether some proof is actually false than to generate the true proof. Consider the P=NP conjecture which is believed to be false and means that for any problem where one can check a solution (e.g. the values of variables satisfying all conditions in the 3-SAT formula. The checking is done by just substituting the values) in polynomial time one can also generate the solution in polynomial time.
So a smaller LLM rejecting the bigger LLM’s outputs would likely teach the latter to be more intelligent than the smaller model.
RLHF ???
Transformers and “attention”: Mostly correct. Alas, SOTA LLMs are capable of things like solving the IMO by using only their pre-set weights and the Chain of Thought where they write their thoughts. We have yet to rule out the possibilities like neuralese recurrence where a larger part of the model is affected.
Having models learn online means that they either modify only a little part of weights or that fast-access long-term memory required to implement the models reaches tens of gigabytes per user which is infeasible.
While I failed to understand the rest of proposals which wouldn’t scale to the ASI, this conclusion requires a more delicate analysis like the ones by Toby Ord and me.
Thanks for your detailed response. I agree that if we have enough data/compute, we could overcome the data/compute inefficiency of AI models. I suspect the AI models are so intensely data/compute inefficient that this will be very difficult though, and that’s what I tried to gesture at in my post. If I could prove it, I’d have written a white paper or something instead of a blog post, but I hoped to at least share some of my thoughts on the subject.
Some specific responses:
Just increasing compute. I agree this is why we measure loss, but that doesn’t imply that measuring loss will get us to superintelligence long-term. Also, for this: “benchmark performance, which is nonzero only for models large enough”, I think you could have benchmarks that scale with the model, like novel games that start simple, and grow more complex as the model gains capability. Either manually, or implicitly as with AlphaGo Zero.
Higher quality data. Thanks for bringing my attention to CALM, I’ll have to look into that. I don’t think using a not-so-intelligent LLM to check whether the student’s idea and the real idea are the same will work in the limit, for the same reason it would be hard to get a kindergartner to grade a high school math test, even if they had access to a correct version written by the teacher. (Assuming the test wasn’t multiple choice, or single numerical answers or something easy to verify.)
Using another smaller LLM as an evaluator. I’m definitely not against all approaches that use a smaller LLM to evaluate a larger LLM, and you’re right to push back here. In fact, I almost suggested one such approach in my “what might work” section. Narrow models like AlphaGo Zero do something like this to great effect. What I’m against specifically is asking smaller models to evaluate the “goodness” of an output, and trusting the smaller LLM to have good judgement about what is good. If it had to judge something specific and objective, that would possibly work. You want to trust the small model only for what it’s good at (parsing sentence structure/basic meaning of outputs, for example) and not what it’s bad at.
RLHF. RLHF works for what it does, but no amount of RLHF can overcome the problems with self-supervised learning I discussed in the post. It’s still a general “reality-stuffing” model. That’s all I meant.
Transformers and “attention”. I do not take the benchmarks like solving the IMO seriously. These same AI models fail to solve kindergarten math worksheets, and fail to solve very basic problems in practice all the time. In particular, it does not seem smart to test how well a model can think by giving it problems that may require a whole lot of thinking, or may require not much, depending on what similar things happened to be in the training data, which we have no idea about. You mentioned P=NP. Solving problems is much easier if you already know how to solve similar-enough problems. We don’t know what similar problems a given model does or does not know how to solve. Rendering the benchmark useless. Unless you construct a benchmark such that we know there can’t have been anything meaningfully similar in the training data (e.g. novel games). (I am unsure whether to take FrontierMath Tier 4 a bit more seriously because the problems seem really hard and unlikely to be similar to anything in the training data, but ideally you’d have a benchmark that works even for less difficult problems anyway.) As for your comment about online learning, I don’t think solving any particular task should require a model to totally reorganize its weights across the entire model. Updating only a little part of weights should be fine. An analogy to humans should show that much. I agree though that having to hold onto fine-tuned partial models for users, even briefly, is more expensive than what we’re doing now, but the capabilities gains may eventually be worth it if non-online-learning models do plateau.