LLMs May Find It Hard to FOOM

Epistemic status: some of the technological progress parts of this I’ve been thinking about for many years, other more LLM-specific parts I have been thinking about for months or days.

TL;DR An extremely high capacity LLM trained on extremely large amounts content from humans will simulate human content extremely accurately, but won’t simulate content from superhumans. I discuss how we might try to change this, and show that this is likely to be an inherently slow process with unfavorable scaling power laws. This might make a fast take-off difficult for any AI based on LLMs.

LLM are trained as simulators for token-generating processes, which (in any training set derived from the Internet) are generally human-like or human-derived agents. The computational capacity of these simulated agents is bounded above by the forward-pass computational capacity of the LLM, but is not bounded below. An extremely large LLM could, and frequently will, produce an exquisitely accurate portrayal of a very average human: a sufficiently powerful LLM may be very superhuman at the task of simulating normal humans with an IQ of around 100, far better at it than any human writer or improv actor — but whatever average human it simulates won’t be superhuman, and that capability is not FOOM-making material.

Suppose we had an LLM whose architecture and size was computationally capable in a single forward pass of doing a decent simulation of a human with an IQ of, let’s say, (to the extent that an IQ that high is even a meaningful concept: let’s make this number better defined by also assuming that this is times the forward-pass computational capacity needed to do a decent simulation of an IQ 100 human). In its foundation model form, this LLM is never going to actually simulate someone with IQ ~1000, since its pretraining distribution contains absolutely no text generated by humans with an IQ of ~1000 (or even IQ over 200): this behavior is way, way out-of-distribution. Now, the Net (and presumably the training set derived from it for this LLM) does contain plenty of text generated very slowly and carefully with many editing passes and much debate by groups of people in the IQ ~100 to ~145 range, such as Wikipedia and scientific papers, so we would reasonably expect the foundation model of such a very capable LLM to also learn the superhuman ability to generate texts like these in a single pass without any editing. This is useful, valuable, and impressive, and might well help somewhat with the early stages of a FOOM, but it’s still not the same thing as actually simulating agents with IQ 1000, and it’s not going to get you to a technological singularity: at some point, that sort of capabilities will top out.

But that’s just the foundation model. Next, the model presumably gets tuned and prompted somehow to get it to simulate (hopefully well-aligned) smarter human-like agents, outside the pretraining distribution. A small (gaussian-tail-decreasing) amount of pretraining text from humans with IQs up to ~160 (around four standard deviations above the mean) is available, and let us assume that very good use is made of it during this extrapolation process. How far would that let the model accurately extrapolate out-of-distribution, past where it has basically any data at all: to IQ 180, probably; 200, maybe; 220, perhaps?

If Hollywood is a good guide, IQ 80-120 humans are abysmal at extrapolating what a character with IQ 160 would do: any time a movie character is introduced as a genius, there are a very predictable set of mistakes that you instantly know they’re going to make during the movie (unless them doing so would actively damage the plot). With the debatable exceptions of Real Genius and I.Q., movie portrayals of geniuses are extremely unrealistic. Yet most people still enjoy watching them. Big Bang Theory was probably the first mass media to accurately portray rather smart people (even if it still had a lot of emphasis on their foibles), and most non-nerds didn’t react like this was particularly new, original, or different.

Hollywood/​TV aside, how hard is it to extrapolate the behavior of a higher intelligence? Some things, what one might call zeroth-order effects, are obvious. The smarter system can solve problems the dumber one could usually solve, but faster and more accurately: what one might describe as “more-of-the-same” effects, which are pretty easy to predict. There are also what one might call first order effects: the smarter system has a larger vocabulary, has learnt more skills (having made smarter use of available educational facilities and time). It can pretty reliably solve problems that the dumber one can only solve occasionally. These are what one might call “like that but more so” effects, and are still fairly easy to predict. Then there are what one might call second-order effects: certain problems that the dumber system had essentially zero chance of solving, the smarter system can sometimes solve: it has what are in the AI business are often called “emergent abilities”. These are frequently hard to predict, and especially so if you’ve never seen any systems that smart before. [There is some good evidence that using metrics that effectively have skill thresholds built into them greatly exaggerates the emergentness of new behaviors, and that on more sensible metrics almost all new behaviors emerge slowly with scale. Nevertheless, there are doubtless things that anyone with IQ below 180 just has 0% probability of achieving, these are inherently hard to predict if you’ve never seen any examples of them, and they may be very impactful even if the smarter system’s success chance for them is still quite small: genius is only 1% inspiration, as Einstein pointed out.] Then there are the third-order consequences of its emergent abilities: those emergent abilities combining and interacting with each other and with all of its existing capabilities in non-obvious ways, which are even harder to predict. Then there are fourth-and-higher order effects: to display the true capabilities of the smarter system, we need to simulate not just one that spent its life as a lonely genius surrounded by dumber systems, but that instead one that grew up in a society of equally-smart peers, discussing ideas with them and building on the work of equally-smart predecessors, educated by equally-smart teachers using correspondingly sophisticated educational methods and materials.

So I’m not claiming that doing a zeroth-order or even first-order extrapolation up to IQ 1000 is very hard. But I think that adding in the second, third, fourth, and fifth-plus-order effects to that extrapolation are increasingly hard, and I think those higher order effects are large, not small, in importance compared to the zeroth and first-order terms. Someone who can do what an IQ 100 person can do but at 10x the speed while using 10x the vocabulary and with a vanishingly small chance of making a mistake is nothing like as scary as an actual suitably-educated IQ 1000 hypergenius standing on the shoulders of generations of previous hypergeniuses.

Let’s be generous, and say that a doable level of extrapolation from IQ 80-160 gets the model to be able to reasonably accurately simulate human-like agents with an IQ of all the way up to maybe about IQ 240. At this point, I can only see one reasonable approach to get any further: you need to have these IQ 240 agents generate text. Lots and lots of text. As in, at a minimum, of the order of an entire Internet’s worth of text. Probably more, since IQ 240 behavior is more complex and almost certainly needs a bigger data set to pin it down. After that, we need to re-pretrain our LLM on this new training set.

[I have heard it claimed, for example by Sam Altman during a public interview, that a smarter system wouldn’t need anything like as large a dataset to learn from as LLMs currently do. Humans are often given as an existence proof: we observably learn from far fewer text/​speech tokens than a less capable LLM does. Of course, the number of non-text token-equivalents from vision, hearing, touch, smell, body position and all our other senses we learn from is less clear, and could easily be a couple-of-orders-of-magnitude larger than our text and speech input. However, humans are not LLMs and we have a lot more inbuilt intuitive biases from our genome. We have thousands of different types of neurons, let alone combinations of them into layers, compared to a small handful for a transformer. While much of our recently-evolved overinflated neocortex has a certain ‘bitter-lesson-like’ “just scale it!” look to it, the rest of our brain looks very much like a large array of custom-evolved purpose-specific modules all wired together in a complicated arrangement: a most un-bitter-lesson-like design. The bitter lesson isn’t about what gives the most efficient use of processing power, it’s about what allows the fastest rate of technological change: less smart engineering and more raw data and processing power. Humans are also, as Eliezer has pointed out, learning to be one specific agent of our intelligence level, not how to simulate any arbitrary member of an ensemble of them. As for the LLMs, it’s the transformer model that is bring pretrained, not the agents it can simulate. LLMs don’t use higher order logic: they use stochastic gradient descent to learn to simulate systems that can do higher-order logic. Their learning process doesn’t get to apply the resulting higher order logic to itself, by any means more sophisticated than SGD descending the gradient curve of its outcome to a closer match to whatever answers are in the pretraining set. So I see no reason to expect that the scaling laws for LLMs are going to suddenly and magically dramatically improve to more human-like dataset sizes as our LLMs “get smarter”. You mileage may of course vary, but I think Sam Altman was either being extremely optimistic, prevaricating, or else expects to stop using LLMs at some point. This does suggest that there could be a capability overhang, telling us that LLMs are not computationally efficient, or at least not data-efficient — they’re just efficient in a bitter-lesson sense, as the quickest technological shortcut to building a brain: a lot faster than whole brain emulation or reverse engineering the human brain, but quite possibly less efficient, or at very least less data-efficient.]

If that’s the case, then as long as we’re using LLMs, the Chinchilla scaling laws will continue to apply, unless and until they taper off into something different (by Sod’s law, probably worse). An LLM capable of simulating IQ 240 well is clearly going to need at least 2.4 times as many parameters as one for IQ 100 (I suspect it might be more like — but I can’t prove it, so let’s be generous and assume I’m wrong here). So by Chinchilla, that means we’re going to need 2.4 times as large a training set generated at IQ ~240 as the IQ ~100 Internet we started off with. So 2.4 times as much content, all generated by things with 2.4 times the inherent computational cost, for a total cost of times creating the Internet. [And that’s assuming the simulation doesn’t require any new physical observations of the world, just simulated thinking time, which seems highly implausible to me: some of those IQ 240 simulations will be of scientists writing papers about the experiments they performed, which will actually need to be physically performed for the outputs to be any use.]

On top of the fact that our first internet was almost free (all we had to do was spider and filter it, rather than simulate the writing of it), that’s a nasty power law. We’re going to need to do this again, and again, at more stages on the way to IQ ~1000, and each time we increase the IQ by a factor of k, the cost of doing this goes up by [again, ignoring physical experiment costs].

Now, lets remove the initial rhetorical assumption that we have an LLM much more powerful than we need, and look at this more realistically as part of an actual process of technological development progress, that needs to be repeated every time our LLM-simulated agents get -fold smarter. The “create, and then re-pretrain from a bigger Internet” requirement remains, and the computational cost of doing this still goes up by . That’s not a encouraging formula for FOOM: that looks a more like formula for a subexponential process where each generation takes times longer than the last (on the assumption that our computational power had gone up k-fold, enough to run our smarter agents at the same speed).

[Is a -fold improvement in computational capacity between generations a plausible assumption, if we get to rebuild our processing hardware each time out agents get -fold smarter? At first, almost certainly not: things like computational capacity normally tend to go up exponentially with technological generations, which is what a k-fold increase in IQ should empower, so as for some constant . With that in the divisor of the time between generations, a fiddling little in the numerator isn’t going to stop the process being superexponential. However, I suspect, for fairly simple physical reasons, that processing power per atom of computronium at normal temperatures has a practical maximum, and speed of light traveling between atoms limits how fast you can run something of a given complexity, so the only way to continually geometrically increase your processing power is to geometrically increase the proportion of the planet (or solar system) that you’ve turned into computronium and its power supplies (just like humans have been doing for human brain computronium), which in turn has practical limits: large ones, but ones a geometrical process could hit soon. Sufficiently exotic not-ordinary-matter forms of computronium might modify this, but repeating this trick each technological generation is likely to be hard, and this definitely isn’t a type of FOOM that I’d want to be on the same planet as. Once you start capping out near the theoretical limits of the processing power of ordinary matter for your computronium, and you’ve picked all the low-hanging fruit on algorithmic speedups, progress isn’t going to stay exponential, and I find it really hard to predict what power law it might asymptote towards: you’re left with algorithmic speedups from better organizing your hierarchical speed-of-light-capped data-flows. A case could be made that fundamental limits are limits, and that it asymptotes to O(1), but that feels a bit too pessimistic to me. So a in the numerator may matter, or it may still be negligible, and most of my uncertainty here is on power law of the the denominator. For now, I’m going to very arbitrarily assume that it asymptotes to , which is enough for the overall process to be superexponential before accounting for the numerator, but subexponential afterwards. That happens to be the power law that lets us run our -fold smarter agents at the same speed, despite their increased complexity. Yes, I’m cherry-picking the exponent of a hard-to-predict power law in order to get an interesting result — please note the word ‘May’ in the title of the article.]

The standard argument for the likelihood of FOOM and the possibility of a singularity is the claim that technological process is inherently superexponential, inherently a J-shaped curve: that progress shortens the doubling time to more progress. If you look at the history of Home sapiens’ technology from the origin of our species to now, it definitely looks like a superexponential J-shaped curve. What’s less clear is whether it still looks superexponential if you divide the rate of technological change by the human population at the time, or equivalently use cumulative total human lifespans rather than time as the x-axis. My personal impression is that if you do that then it looks like the total number of human lifespans between different major technological advances is fairly constant, and it’s a boring old exponential curve. If I’m correct, then the main reason for Homo sapiens’ superexponential curve is that technological improvements also enlarge the human population-carrying capacity of the Earth, and improve our communication enough to handle this, so let us do more invention work in parallel. So I’m not entirely convinced that technological change is in fact inherently superexponental, short of dirty (or at least ecologically unsound) tricks like that, which might-or-might-not be practicable for an ASI trying to FOOM to replicate. [Of course, Homo sapiens wasn’t actually getting smarter, only more numerous, better educated and better interconnected — that could well make a difference.

However, even if I’m wrong and technology inherently is a superexponential process, this sort of power law is a plausible way to convert a superexponential back to an exponential or even subexponential. Whether this happens depends just how superexponential your superexponential is: so that means the expected FOOM may, or may not, instead be just a rising curve with no singularity within any finite time.

Now, one argument here would be that this is telling us that LLMs are inefficient, our AIs need to switch to building their agent minds directly, and this is just a capacity overhang. But I think the basic argument still applies, even after doing this: something times smarter needs times the processing power to run. But to reach its full capability, it also needs suitable cultural knowledge as developed by things as smart as it. That will be bigger, by some power of , call it , than the cultural knowledge needed by the previous generation. So the total cost of generating that knowledge goes up by a power of . I’m pretty sure will be around 1 to 2, so is in the range around 2 to 3. So changing to a different architecture still doesn’t get rid of the unhelpful power law. Chinchilla is a general phenomenon: to reach their full potential, smarter things need more training data, and the cost of creating that (and training) on it goes up as the product of the two.

So, I’m actually somewhat dubious that FOOM or a singularity is possible at all, for any cognitive architecture, given finite resource limits, once your computronium efficiency starts to max out. But it definitely looks harder with LLM scaling laws.

So, supposing that all this hypothesizing were correct, then what would this mean for ASI timelines? It doesn’t change timelines until a little after transformative AGI is achieved. But it would mean that the common concern that AGI might be followed only a few years or even months later by a FOOM singularity was mistaken. If so, then we would find ourselves able to cheaply apply many (hopefully well-aligned) agents that were perhaps highly superhuman in certain respects, but that overall were merely noticeably smarter than the smartest human who has ever lived, to any and all problems we wanted solved. The resulting scientific and technological progress and economic growth would clearly be very fast. But the AIs may tell us that, for reasons other than simple processing power, creating an agent much smarter than them is a really hard problem. Or, more specifically, that building it isn’t that hard, but preparing the syllabus for properly training/​educating it is. They’re working on it, but it’s going to take even them a while. Plus, the next generation after that will clearly take longer, and the one after that longer still, unless we want to allow them to convert a growing number of mountain ranges into computronium and deserts into solar arrays. Or possibly the moon.

Am I sure enough of all this to say “Don’t worry, FOOM is impossible, there definitely will not be a singularity?” No. In addition to the uncertainty about how effective processing power per atom asymptotes, Grover’s algorithm running on quantum hardware might change the power laws involved just enough to make things superexponential again, say by shifting to . Or we might well follow the “Computronium/​population growth? Sure!” path to a J-shaped curve, at least until we’ve converted the Solar System into a Dyson swarm. However, this argument has somewhat reduced my . Or at least my .



PostScript Edit: Given what I’ve been reading due to recent speculations around Q* since I wrote this post, plus some of the comments below, I now want to add a significant proviso to it. There are areas, such as Mathematics, physical actions in simulated environments, and perhaps also coding, where it’s possible to get rapid and reliable objective feedback on correctness at not-exorbitant costs. For example, in Mathematics, systems such as automated proof checkers can check sufficiently detailed mathematical proofs (written in Lean or some equivalent language), and, as sites like HackerRank demonstrate, automated testing of solutions to small software problems can also be achieved. So in areas like these where you can arrange to get rapid, accurate feedback, automated generation of high-quality synthetic training data to let you rapidly scale performance up to far superhuman levels may be feasible. The question is, can this be extended to a wide-enough variety of different training tasks to cover full AGI, or at least, to cover all the STEM skills needed to go FOOM. I suspect the answer is no, but it’s a thought-provoking question. Alternatively, we might have IQ 1000 AI mathematicians a long time before we have the same level of performance in fields like science where verifying that research is correct and valuable takes a lot longer.