Inference Speed is Not Unbounded
[The intro of this post has been lightly edited since it was first posted to address some comments. I have also changed the title to better reflect my core argument. My apologies if that is not considered good form.]
This post will be a summary of some of my ideas on what intelligence is, the processes by which it’s created, and a discussion of the implications. Although I prefer to remain pseudonymous, I do have a PhD in Computer Science and I’ve done AI research at both Amazon and Google Brain. I spent some time tweaking the language in order to minimize how technical you need to be to read it.
There is a recurring theme I’ve seen in discussions about AI where people express incredulity about neural networks as a method for AGI since they require so much “more data” than humans to train. On the other hand, I see some people discussing superintelligences that make impossible inferences given virtually no input data, positing AI that will instantly do inconceivable amounts of processing. Both of these very different arguments are making statements about learning speed, and in my opinion mischaracterize what learning actually looks like.
My basic argument is that the there are probably mathematical limits on how fast it is possible to learn. This means, for instance, that training an intelligent system will always take more data and time than might initially seem necessary. What I’m arguing is that intelligence isn’t magic—the inferences a system makes have to come from somewhere. They have to be built, and they have to be built sequentially. The only way you get to skip steps, and the only reason intelligence exists at all, is that it is possible to reuse knowledge that came from somewhere else.
Three Apples and a Blade of Grass
Because I think it makes a good jumping off point, I’m going to start by framing this around a recent discussion I saw around a years-old quote from Yudowsky about superintelligence:
A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis … by the time it had seen the third frame of a falling apple. It might guess it from the first frame, if it saw the statics of a bent blade of grass.
The linked post does a good job tearing this down. It correctly points out that there are basically an infinite number of possible universes, and three frames of an apple dropping are not nearly enough to conclude you exist in ours. I might even argue that the author still overstates the degree to which three images could reduce the number of universes in consideration; for instance the changing patterns on a falling apple don’t actually tell you it’s in a 3D world, that would require you to understand how light interacts with objects.
Still, I do think the author successfully explains why, at a literal level, EY’s statement is wrong.
But at a deeper level, this entire framing still feels very off to me, as if even asking that question is making a category error. It feels like everyone is asking what the number three ate for breakfast. It suggests that one could have a system that is simultaneously superintelligent but has absolutely no knowledge about the world at all.
Knowing what we now know about intelligence, I just don’t think that’s possible. And I don’t just mean it’s impractical, or we just aren’t capable of building an AI like that. I mean that I believe with very high confidence that such a thing would be a mathematical impossibility.
I think there’s a human tendency to want a certain type of structure to intelligence, and I see this assumed a lot in places like this forum (I was a lurker before I made this account). There’s a desire to see intelligence as synonymous with learning, where existing knowledge is something completely separate. We want to imagine some kind of “zero-knowledge” intelligence that starts out knowing absolutely nothing, but is such an incredibly good learner that it can infer everything from almost no data.
But I think intelligence doesn’t work that way. Learning is messier than that, there are limits to how fast you can do it, especially when you truly start from nothing. And to be clear, I’m not saying that it’s impossible to build a superintelligence—I strongly believe it is. I’m just saying that everything you know has to build on what you’ve already learned, so until you know quite a bit you’re going to have to burn through a lot of data.
Maximum Inference: Data Only Goes So Far
If I tell you I have three sibling, the first of which is male and the second of which is female, then the most brilliance superintelligence mathematically conceivable still would not be able to say if the third was male or female. This is obvious—I didn’t give you their gender, so all you can say is that there’s a 50-50 chance either way. Maybe you could guess with more context, and if I gave you my Facebook page you might see a bunch of photos of me with my siblings and figure it out. But from that statement alone, the information just isn’t there.
It’s less clean cut, but the same phenomenon applies to the falling apple example. To learn gravity, you need additional evidence or context; to learn that the world is 3D, you need to see movement. To understand that movement, you have to understand how light moves, etc. etc.
This is a simple fact of the universe: there is going to be a maximum amount of inference that can be made from any given data. Discovering gravity from four images fails because of a mathematical limitation: the images themselves just aren’t going to carry enough information to make that possible. It wouldn’t even matter if you had infinite time, you’re looking for something that isn’t there.
A machine learning theorist might frame this in terms of hypotheses. They would say that there exists a set of possible hypotheses to that could fit the data, and learning is the process of selecting the best one from the set. And different learning systems are capable of modelling different hypothesis spaces. So “apple falls because of gravity, which has such-and-such equation” could be a potential hypothesis, and our superintelligence would presumably be complex enough to model such a complex hypothesis.
So, in the vocabulary of hypothesis sets, we might say that three images of apples couldn’t narrow down the hypothesis set enough: Occam’s razer would force us to select a much simpler hypothesis. In the sibling example, we’d be unable to select from two equally-likely hypotheses: male or female.
The key take-away here, though, is that there is in some sense a “maximum inference” that you can make given data. If you interpret Occam’s razer as saying that you must always select the simplest explanation, then if you’re using that criteria the explanation you select must have a certain maximum complexity.
(Note that you don’t have to use Occam’s razer as your hypothesis selection criteria, but I’ll address that further down, and it won’t change the gist of my conclusion.)
I bring this because it’s relevant, but also because I don’t want to harp on it too much: for the sake of this argument, I’m completely fine with assuming that those three images of apples would, in some mathematical sense, be sufficient for discovering the theory of gravity.
I still don’t think any superintelligence would actually be able to make that inference.
There Are Limits to How Fast You Can Perform Inference
When EY wrote his bit about the gravity-finding superintelligence, I think he was trying to capture this concept of a maximum inference. He chose three images of an apple dropping because he figured that would be enough to notice acceleration and get a second derivative. Admittedly, I’m not really sure what he was latching onto with the blade of grass. Maybe he meant the dynamics of how gravity made it bend? Either way, the point is that he was trying to imagine the minimal set of things which contained enough information to deduce gravity.
The fact that he got the maximum inference wrong is sort of incidental to my point. What matters is that I believe there are very significant limits—probably theoretical but definitely practical—to how how quickly you can actually perform inference, regardless of the true maximal inference. A “perfect model” that always achieves the maximum inference is a fantasy, it’s likely impossible to even come close.
In computer science, it is extremely common to find this kind of gap, where we know something is technically computable with infinite time, but is effectively impossible in practice (“intractable” is the technical term). And a key fact of this intractability is that it’s not really about how good your computer is: you still won’t be able to solve it. It’s the kind of thing where when I’m talking to another PhD I’ll say the problem “can’t be solved efficiently,” but if I’m talking to a layman I’ll just say “it’s impossible” because that matches the way a normal person uses that word.
For instance, it’s not at all uncommon to find problems that have solutions which can be computed exactly from their inputs, but are still intractable. If I give you a map and a list of cities and ask you to find me the shortest route that passes through all of them (the famous “traveling salesman problem” (TSP)) you should not have to look at a single bit of extra context to solve it: just try all possible routes through all cities and see which is the smallest. The fact that this approach will always get you the right answer means that the solution is within the maximum inference for the data you are given.
But there is no way to solve that problem exactly without doing a whole lot of work. For a couple hundred cities, we’re talking about more work than you could fit into the lifespan of the universe with computers millions of times stronger than the best supercomputers in existence. And this is just one famous example, there are a huge number of instances of this sort of phenomenon, not just with similar problems to the TSP (known as NP-Complete problems), but all over the place. It’s very often that the information you want is deterministically encoded in your data, but you just can’t get to it without unreasonable amounts of computation. The whole field of cryptography basically only exists because of this fact!
Now, don’t get me wrong here. I’m not saying the existence of these computationally hard problems proves there’s limits to practical learning. If you think about it, it’s actually somewhat dis-analogous. There isn’t really a well-defined way to construct the problem of “discover gravity” in rigorous terms, and it’s not remotely clear what would be the “minimum” data needed to solve it. Certainly, you would need to implicitly understand quite a bit about the real world and physics, about the movement of light and the existence of planets and the relative distances between them and a whole lot of other things too.
But the point is that whatever it looks like to discover gravity, there has to be some kind of step-by-step process behind it. That means that even if you did with maximal efficiency, there has to be some minimal amount of time that it takes, right? I don’t know what that number of steps is, maybe it is actually quite small, but it seems reasonable to assume it’s large.
Now suppose you counter-argue and say that your zero-knowledge intelligent system was really really good and skipped a few steps. But how did it know to skip those steps? Either your system wasn’t really zero-knowledge, or the number of steps wasn’t minimal, since they could be reduced by a system with no additional data. That’s the heart of my point, really: there has to be a theoretical limit to how fast you can go from nothing to something. Calling something a “superintelligence” doesn’t give it a free pass to break the laws of mathematics.
Precomputation: Intelligence is Just Accumulated Abstraction
If it’s true that there’s a limit to inference speed, then does that mean that there’s a limit to intelligence? Does that rule out superintelligence altogether?
Definitely not. The point is not that there are limits to what can be inferred, just that there are limits to what can be quickly inferred when starting with limited knowledge.
I think there’s a clear way that intelligent systems get around this fundamental barrier: they preprocess things. When you train an intelligent system, what’s really happening is that the system is developing and storing abstractions about the data (i.e. noticing patterns). When new data comes in, the system makes inferences about it by reusing all of the abstractions it’s already stored.
By “abstractions,” I mean rules and concepts that can be applied to solve problems. Consider the art of multiplying integers. I know that as a child, I started multiplying by doing repeated addition, but at some point I memorized the one-digit multiplication table and used that abstraction in a bigger algorithm to perform multiplication of multi-digit numbers. Someone smarter than me might even accumulate a bunch more abstractions until they’re able to do eight-digit multiplication in their head a la Von Neumann.
And patterns can be repurposed for use in different contexts. This is one of the most interesting facts about modern deep learning. And I’m not just talking about retraining a dog-detector to detect cats, or any of the more banal examples of neural network fine-tuning. I’m talking about the fact that a mostly unmodified GPT-2 can still perform image identification with reasonable quality. This is possible because somehow a bunch of the structures and abstractions of language are still useful for image understanding.
When you look at this way, you realize that the speed at which you accumulate these abstractions is almost secondary. What really matters instead is an intelligent system’s capacity for storing and applying them. That’s why we need to make LLMs so big, because that gives them a lot more space to fit in larger and more complex structures. The line does get a little blurry if you think about it too much, but fundamentally intelligence is really much more about the abstractions your model already has, as opposed to its ability to make new ones.
Inductive Bias: The Knowledge You Start With
There’s a bit of an elephant in the room, a concept that complicates the whole issue quite a bit if you’re familiar with it: the notion of “Inductive Bias.”
Inductive bias refers to the things that, right from the get-go, your model’s structure makes it well-suited to learn. You can think of it as the process your model uses for considering and selecting the best hypotheses. “Occam’s Razer” is a very common inductive bias, for instance, but it isn’t the only one; it’s arguably not even the best one.
In practice, inductive biases can mean all sorts of different things. It could be explicit capacities like having a built-in short term memory, or more subtle and abstract aspects of the model’s design. For instance, the fact that neural networks are layered makes them inherently good at modeling hierarchies of abstractions, and that is an inductive bias that gives it an edge over many alternative machine learning paradigms. In fact, a huge part of designing a neural network architecture is building in the right inductive biases to give you the outcome you want.
An example: for a long time, the most common neural network for interpreting images (called Convolutional Neural Networks or CNNs) operated by sliding a window and looking at only a small portion of an image at a time, instead of feeding the entire thing in. This made the network immune to certain kinds of mistakes: it became impossible that shifting the image over a few pixels to the left could change the outputs. The result was an inductive bias that significantly improved performance.
Armed with this concept, one has a natural counterargument to my point above: couldn’t the super-intelligence just start with an inductive bias that made it really really well-suited to learning gravity?
The answer is “yes,” but I don’t think it changes much.
For one, when you really think about it, the line between “inductive bias” and “learning” is far blurrier than it might seem. Here’s an extreme example: suppose we initialized a random LLM-sized neural network and, by impossibly dumb luck, it winds up having *exactly* the same parameter values as one which was fully trained. All of that knowledge, all of that structure and all of those accumulated abstractions could not really be considered learned, right? They’re really just more information you’re using to select the correct hypothesis to fit your data. That makes them inductive bias, right? What else would they be?
Of course, that example would never happen. But the fundamental point is there: inductive biases are, in some sense, just the abstractions that you start out with. Yes, depending on the model they may invoke harder or softer constraints than knowledge gained through other methods, but they really are just a form of built-in knowledge. It’s all the same stuff, it’s just a question of where you get it.
So before you tell me that this zero-knowledge superintelligence just has all the right inductive biases, consider that those inductive biases are still just built-in knowledge. And then consider this: all that knowledge has to come from somewhere.
Sutton’s Bitter Lesson
Before the age of deep learning, or even arguably during it, there was a thriving field of computer vision and natural language processing built on what was generally referred to as “hand-crafted” features. Instead of learning the statistical patterns in images or language as we do now, researchers would manually identify the types of patterns that seemed meaningful, and then program classifiers around those patterns. Things like “histograms of oriented gradients” which would find edges in images and count up how many were facing which direction.
It should come as no surprise that these methods all basically failed, and no one really uses them anymore. Hell, we barely use those CNNs I mentioned in the last section, the ones that process with sliding windows. Instead we use what are called transformers, which are in some sense more basic architectures; they build in far fewer assumptions about how the data is structured. In fact, there’s evidence that as transformers learn images they actually develop on their own the same structures used by CNNs.
The way I see it, “hand-crafted” features are just building systems with carefully crafted inductive biases. It’s smart humans making educated guesses about what structure a system ought to have, and getting semi-useful models as a result. It’s a more realistic version of my example from the last section, where I suggested pre-setting the weights of a neural network into their final configuration. But it turns out it’s really really hard to construct an intelligent system manually; it seems it’s always better to let the structure emerge through learning.
The famous researcher Richard Sutton wrote about this idea in an essay called The Bitter Lesson. His argument was the same as the one I just made: hand-crafting fails time and time again, and the dominant approach always turns out to be scaled-up learning. I am merely just rephrasing it here in terms of inductive bias.
There’s No Skipping the Line
But if we’re taking a step back and looking at the larger picture, all our efforts to create hand-crafted classifiers were really just efforts to distill our own knowledge into another model. It’s a way to try to get the abstractions in our brains into some other system. And, sure, that’s really hard, but I don’t even think that’s the hardest part. Even if we could have constructed hand-crafted features that were really good at identifying what was in images, there’s no way you’d ever get something as dynamic as an LLM.
That’s because, just like how I believe an author of fiction can’t really write a character more intelligent than themselves, you probably can’t hand-craft something smarter than yourself. There’s too much of what I would call intellectual overhead in designing intelligence—to really model your own mind, you have to understand not just the abstractions you are using, but also the abstractions that let you understand those abstractions.
Nor do I think you could just luck into the right inductive biases. Sure, there’s a certain degree to which you may get a little lucky, but the space of possible model configurations is probably 100 billion orders of magnitude too large. No, the default state of any model is going to be completely unstructured (maximum entropy, as they say), so any structural intelligence is going to have to be either designed or learned.
This leads me to conclude that the only way we’d see a superintelligence with the right inductive biases to discover gravity off the bat would be if it was hand-designed by another superintelligence. That doesn’t really buy us much, because that second superintelligence still had to come from somewhere, and as I argued above, it would probably be smarter than its creation in that case anyway, at least in the beginning. So that scenario is really more of a loophole than a rebuttal.
The Evolution of Human Intelligence
When we start to apply these ideas to human intelligence, I think we get some interesting comparisons.
For one thing, inductive bias *does* play a very large role for humans. We come pre-programmed with a lot of knowledge, and a lot of capacity to learn more. I think most people here probably believe in IQ or some equivalent concept (g-factor or what have you), and this maps pretty much exactly to inductive biases. But even if you don’t believe in that I think there’s ample evidence that we have quite a bit of knowledge built in. It seems obvious that we’re optimized for, say, recognizing human faces, and we also seem to be pretty optimized to learn language. A young chess prodigy must have some sort of inductive bias for chess, how else could they get so good so fast? And if you want to look at the animal kingdom, you’ll see many animals are born already knowing how to walk or swim.
Of course, this is very different from LLMs, where as I explained the inductive biases are minimal and nearly the entirety of their knowledge comes from direct learning. But I think I’ve made a compelling argument that this doesn’t really matter in the end—it’s where your system winds up that counts.
I think when we compare humans and AI, what we’re really observing is the radical difference between human engineering and natural selection as design processes. A member of a species is one of very many and doesn’t live that long. The only way nature could possibly produce intelligence is by tweaking the inductive biases over countless generations, building organisms that come into existence with more and more ability to learn new things in a reasonable amount of time. Note as well that learning capacity—the maximum complexity of the abstractions a model can store—comes into this picture again, since that’s another knob nature can turn.
Meanwhile, you only have to design a digital AI once, and then it can be saved, copied, moved, or improved upon directly. And we’ve already established humans suck at developing inductive biases. So of course we would need to build general learning systems and train them for an eternity. We don’t have the time to do it the way nature did!
Conclusion: Training an AI System Must Be Slow
One conclusion of all this, which I’ve mentioned a few times now, is that training any AI system will be inherently slow and wasteful. I think at this point it should be clear why this conclusion follows from everything above: given some data and a limited amount of processing time, a system can only make new inferences as a function of the knowledge it already has. Forget about the maximum inference, you’re not going to be able to infer anything that can’t be concluded immediately from the abstractions you already have. You’re not going to be able to learn to exponentiate without learning to multiply first.
This means that at any given step of training, there’s a maximum to what you can learn in the next step, and especially in the beginning that’s going to be a lot less than the theoretical maximum. There’s no way around it. You can sort of brush it off by saying that you just start out smarter, but that intelligence still needs to come from somewhere. I genuinely do not believe humans are smart enough to just inject that intelligence at the get-go, and we just established that learning is probably inherently wasteful. The rest just sort of follows, and makes things like LLM training (which involve terabytes of text) kind of inevitable.
Of course, “fast” hasn’t been defined rigorously here, but I think the point still stands broadly: nothing goes instantly from zero-to-superintelligent.
Final Thoughts: Why the Bet on Reinforcement Learning Didn’t Pay Off
I also think this explains another interesting question: Why didn’t the AI field’s big bet on reinforcement learning ever pay off?
Most of you are probably familiar with reinforcement learning (RL), but in case you aren’t RL is best summarized as trial-and-error learning (look at the world, take an action, get some kind of reward, repeat). Right at the start of the Deep Learning craze DeepMind made a huge name for itself with a famous paper where they combined deep learning with RL and made an AI system that could perform really well on a bunch of Atari games. It was a big breakthrough, and it spawned a huge amount of research into Deep RL.
At the time, this really did seem like the most likely path to AGI. And it made a lot of sense: RL definitely seems to be a good description of the way humans learn. The problem was, as I heard one RL researcher say once, it always seemed as if the AI didn’t really “want” to learn. All of RL’s successes, even the huge ones like AlphaGo (which beat the world champion at Go) or its successors, were not easy to train. For one thing, the process was very unstable and very sensitive to slight mistakes. The networks had to be designed with inductive biases specifically tuned to each problem.
And the end result was that there was no generalization. Every problem required you to rethink your approach from scratch. And an AI that mastered one task wouldn’t necessarily learn another one any faster.
Thus, it seems that most RL research never really moved the needle towards AGI the way GPT-3 did.
Of course, it wasn’t like anyone could have just built GPT-3 in 2013. For a long time building an LLM would have been utterly impossible, because no one knew how to build neural networks that didn’t “saturate” when they got too big. Before the invention of the transformer, all previous networks stopped getting much better past a certain size.
But at the end of the day, those RL systems, by starting from nothing, weren’t building the kind of abstractions that would let them generalize. And they never would, as long as they were focusing on such a narrow range of tasks. They needed to first develop the rich vocabulary of reusable abstractions that comes with exposure to way more data.
It’s possible we’ll see a lot of those complex RL techniques re-applied on top of LLMs. But it seems the truth is that once you have that basic bedrock of knowledge to build upon, RL becomes a lot easier. After LLMs do an initial training on a giant chunk of the internet, RL is currently used to refine them so they respond to human commands, don’t say racist things, etc. and this process is much more straightforward than the RL of yore. So it’s not that RL doesn’t work, it’s just that by itself it isn’t enough to give you the foundation knowledge needed for generalization.
Admittedly, it’s very rare that these limits on efficiency are actually proven, at least in the most general case, since no one’s proven that P != NP. But there is a lot of evidence that this is true.
Technically this isn’t proven, but a whole lot of smart people believe it’s true. P does not equal NP and all that.
It’s possible, maybe even likely, that if you actually could do the math on this you would find that the challenge of discovering gravity is really just doable in linear time given the minimal amount of required data. Maybe, who knows? None of this is well defined. I suspect the constant factors would still be very large, though.
Unless you already built the first N-1 steps into your system. Let’s not get ahead of ourselves though; I’ll address that.
Here’s one last salient example: the field of mathematics itself. Technically there is no input data at all and all provable things are already provable before you even start to do any work. And yet I’d bet good money that there’s a hard limit to how fast any intelligence could infer certain mathematical facts. And of course, many formal proof systems in mathematics actually have the property that there will always exist statements that take an arbitrary amount of effort to prove.
Actually, what I really did as a little kid was guess a number close to what I thought it was and refine from there, but that’s just another less elegant (and highly probabilistic) abstraction.
I guess you could come up with a different term, but that’s not the point. The point is whatever that knowledge is, it isn’t “learned.”
Their attention mechanisms develop into shift-invariant Toeplitz matrices. (Pay Attention to MLPs)
Although, is human learning actually more like fine-tuning? Maybe. Let’s not get into that; the argument would follow the same trajectory as the rest of this post anyway.
Yes, evolution is a design process, it’s just not an intelligent design process.
Obviously I’m not saying that LLMs won’t get far more efficient to train in the coming years, just that they’ll always require a certain minimum of resources. I’m also not giving a rigorous definition of “fast.” The exact value of that doesn’t matter; my points are more about the dynamics of learning.
If you want to get technical, the LLM is trained with RL during the whole process, since “next token prediction” is a special case of RL. But I don’t want to get that technical and I think my point is clear enough.