Some Arguments Against Strong Scaling

There are many people who believe that we will be able to get to AGI by basically just scaling up the techniques used in recent large language models, combined with some relatively minor additions and/​or architectural changes. As a result, there are people in the AI safety community who now predict timelines of less than 10 years, and structure their research accordingly. However, there are also people who still believe in long(er) timelines, or at least that substantial new insights or breakthroughts will be needed for AGI (even if those breakthroughts in principle could happen quickly). My impression is that the arguments for the latter position are not all that widely known in the AI safety community. In this post, I will summarise as many of these arguments as I can.

I will almost certainly miss some arguments; if so, I would be grateful if they could be added to the comments. My goal with this post is not to present a balanced view of the issue, nor is it to present my own view. Rather, my goal is just to summarise as many arguments as possible for being skeptical of short timelines and the “scaling is all you need” position.

This post is structured into four sections. In the first section, I give a rough overview of the scaling is all you need-hypothesis, together with a basic argument for that hypothesis. In the second section, I give a few general arguments in favour of significant model uncertainty when it comes to arguments about AI timelines. In the third section, I give some arguments against the standard argument for the scaling is all you need-hypothesis, and in the fourth section, I give a few direct arguments against the hypothesis itself. I then end the post on a few closing words.


LLM—Large Language Model
SIAYN—Scaling Is All You Need

The View I’m Arguing Against

In this section, I will give a brief summary of the view that these arguments oppose, as well as provide a standard justification for this view. In short, the view is that we can reach AGI by more or less simply scaling up existing methods (in terms of the size of the models, the amount of training data they are given, and/​or the number of gradient steps they take, etc). One version says that we can do this by literally just scaling up transformers, but the arguments will apply even if we relax this to allow scaling of large deep learning-based next-token predictors, even if they would need be given a somewhat different architecture, and even if some extra thing would be needed, etc.

Why believe this? One argument goes like this:

(1) Next-word prediction is AI complete. This would mean that if we can solve next-word prediction, then we would also be able to solve any other AI problem. Why think next-word prediction is AI complete? One reason is that human-level question answering is believed to be AI-complete, and this can be reduced to next-word prediction.

(2) The performance of LLMs at next-word prediction improves smoothly as a function of the parameter count, training time, and amount of training data. Moreover, the asymptote of this performance trend is on at least human performance.

(*) Hence, if we keep scaling up LLMs we will eventually reach human-level performance at next-word prediction, and therefore also reach AGI.

An issue with this argument, as stated, is that GPT-3 already is better than humans at next-word prediction. So are both GPT-2 and GPT-1, in fact, see this link. This means that there is an issue with the argument, and that issue is that human-level performance on next-word prediction (in terms of accuracy) evidently is insufficient to attain human-level performance in question answering.

There are at least two ways to amend the argument:

(3) In reaching the limit of performance for next-word prediction, an LLM would invariably develop internal circuits for all (or most) of the tasks of intelligent behaviour, or

(4) the asymptote the of LLM performance scaling is high enough to reach AI-complete performance.

Either of these would do. To make the distinction between (3) and (4) more explicit, (4) says that a “saturated” LLM would be so good at next-word prediction that it would be able to do (eg) human-level question answering if that task is reduced to next-word prediction, whereas (3) says that a saturated LLM would contain all the bits and pieces needed to create a strong agentic intelligence. With (3), one would need to extract parts from the final model, whereas with (4), prompting would in theory be enough by itself.

I will now provide some arguments against this view.

General Caution

In this section, I will give a few fairly general arguments for why we should be skeptical of our impressions and our inside views when it comes to AI timelines, especially in the context of LLMs. These arguments are not specifically against the SIAYN hypothesis, but rather some arguments for why we should not be too confident in any hypothesis in the reference class of the SAIYN hypothesis.

1. Pessimistic Meta-Induction

Historically, people have been very bad at predicting AI progress. This goes both for AI researchers guided by inside-view intuitions, and for outsiders relying on outside-view methods. This gives a very general reason for always increasing our model uncertainty quite substantially when it comes to AI timelines.

Moreover, people have historically been bad at predicting AI progress in two different ways; first, people have been bad at estimating the relative difficulty of different problems, and second, people have been bad at estimating the dependency graph for different cognitive capacities. These mistakes are similar, but distinct in some important regards.

The first problem is fairly easy to understand; people often assume that some problem X is easier than some problem Y, when in fact it is the other way around (and sometimes by a very large amount). For example, in the early days of AI, people thought that issues like machine vision and robot motion would be fairly easy to solve, compared to “high-level” problems such as planning and reasoning. As it turns out, it is the other way around. This problem keeps cropping up. For example, a few years ago, I imagine that most people would have guessed that self-driving cars would be much easier to make than a system which can write creative fiction or create imaginative artwork, or that adversarial examples would turn out to be a fairly minor issue, etc. This issue is essentially the same as Moravec’s paradox.

The second problem is that people often assume that in order to do X, an AI system would also have to be able to do Y, when in fact this is not true. For example, many people used to think that if an AI system can play better chess than any human, then it must also be able to form plans in terms of high-level, abstract concepts, such as “controlling the centre”. As it turns out, tree search is enough for super-human chess (a good documentary on the history of computer chess can be found here). This problem also keeps cropping up. For example, GPT-3 has many very impressive abilities, such as the ability to play decent chess, but there are other, simpler seeming abilities that it does not have, such as the ability to solve a (verbally described) maze, or reverse long words, etc.

There could be many reasons for why we are so bad at predicting AI, some of which are discussed eg here. Whatever the reason, it is empirically very robustly true that we are very bad at predicting AI progress, both in terms of how long it will take for things to happen, and in terms of in what order they will happen. This gives a general reason for more skepticism and more model uncertainty when it comes to AI timelines.

2. Language Invites Mind Projection

Historically, people seem to have been particularly prone to overestimate the intelligence of language-based AI systems. Even ELIZA, one of the first chat bots ever made, can easily give off the impression of being quite smart (especially to someone who does not know anything about how it works), even though it is in reality extremely simple. This also goes for the many, many the chat bots that have been made over the years, which are able to get very good scores on the Turing test (see eg this example). They can often convince a lay audience that they have human-level intelligence, even though most of these bots don’t advance the state of the art in AI.

It is fairly unsurprising that we (as humans) behave in this way. After all, in our natural environment, only intelligent things produce language. It is therefore not too surprising that we would be psychologically inclined to attribute more intelligence than what is actually warranted to any system that can produce coherent language. This again gives us a fairly general reason to question our initial impression of the intelligence of a system, when that system is one that we interact with through language.

It is worth looking at some of Gary Marcus’ examples of GPT-3 failing to do some surprisingly simple things.

3. The Fallacy of the Successful First Step

It is a very important fact about AI, that a technique or family of techniques can be able to solve some version of a task, or reach some degree of performance on that task, without it being possible to extend that solution to solve the full version of the task. For example, using decision trees, you can get 45 % accuracy on CIFAR-10. However, there is no way to use decision trees to get 99 % accuracy. To give another example, you can use alpha-beta pruning combined with clever heuristics to beat any human at chess. However, there is no way to use alpha-beta pruning to combined with clever heuristics to beat any human at go. To give a third example, you can get logical reasoning about narrow domains of knowledge using description logic. However, there is no way to use description logic to get logical reasoning about the world in general. To give a fourth example, you can use CNNs to get excellent performance on the task of recognising objects in images. However, there is (seeminly) no way to use CNNs to recognise events in videos. And so on, and so forth. There is some nice discussion on this issue in the context of computer vision in this interview.

The lesson here is that just because some technique has solved a specific version of a problem, it is not guaranteed to (and, in fact, probably will not) solve the general version of that problem. Indeed, the solution to the more general version of the problem may not even look at all similar to a solution to the smaller version. It seems to me like each level of performance often cuts off a large majority of all approaches that can reach all lower levels of performance (not just the solutions, but the approaches). This gives us yet another reason to be skeptical that any given method will continue to work, even if it has been successful in the past.

Arguments Against the Argument

In this section, I will point out some flaws with the standard argument for the SIAYN hypothesis that I outlined earlier, but without arguing against the SIAYN hypothesis itself.

4. Scaling Is Not All You Need

The argument I gave in Section 2 is insufficient to conclude that LLM scaling can lead to AGI in any practical sense, at least if we use premise (4) instead of the much murkier premise (3). To see this, note that the argument also applies to a few extremely simple methods that definitely could not be used to build AGI in the real world. For example, suppose we have a machine learning method that works by saving all of its training data to a lookup table, and at test time gives a uniform prediction for any input that is not in the lookup table, and otherwise outputs the entry in the table. If some piece of training data can be associated with multiple labels, as is the case with next-word prediction, then we could say that the system outputs the most common label in the training data, or samples from the empirical distribution. If this system is used for next-word prediction, then it will satisfy all the premises of the argument in Section 2. Given a fixed distribution over the space of all text, if this system is given ENOUGH training data and ENOUGH parameters, then it will EVENTUALLY reach any degree of performance that you could specify, all the way down to the inherent entropy of the problem. It therefore satisfies premise (2), so if (1) and (4) hold too then this system will give you AGI, if you just pay enough. However, it is clear that this could not give us AGI in the real world.

This somewhat silly example points very clearly at the issue with the argument in Section 2; the point cannot cannot be that LLMs “eventually” reach a sufficiently high level of performance, because so would the lookup table (and decision trees, and Gaussian processes, and so on). To have this work in practice, we additionally need the premise that LLMs will reach this level of performance after a practical amount of training data and a practical amount of compute. Do LLMs meet this more strict condition? That is unclear. We are not far from using literally all text data in existence to train them, and the training costs are getting quite hefty too.

5. Things Scale Until They Don’t

Suppose that we wish to go to the moon, but we do not have the technology to do so. The task of getting to the moon is of course a matter of getting sufficiently high up from the ground. Now suppose that a scientist makes the following argument; ladders get you up from the ground. Moreover, they have the highly desirable property that the distance that you get from the ground scales linearly in the amount of material that you use to construct the ladder. Getting to the moon will therefore just be a matter of a sufficiently large project investing enough resources into a sufficiently large ladder.

Suppose that we wish to build AI, but we do not have the technology to do so. The task of building AI is of course a matter of creating a system that knows a sufficiently large number of things, in terms of facts about the world, ways to learn more things, and ways to attain outcomes. Suppose someone points out that all of these things can be encoded in logical statements, and that the more logical statements you encode, the closer you get to the goal. Getting to AI will therefore just be a matter of a sufficiently large project investing enough resources into encoding a sufficiently large number of facts in the form of logical statements.

And so on.

6. Word Prediction is not Intelligence

Here, I will give a few arguments against premise/​assumption (3); that in reaching the limit of performance for next-word prediction, an LLM would invariably develop internal circuits for all (or most) of the tasks of intelligent behaviour. The kinds of AI systems that we are worried about are the kinds of systems that can do original scientific research and autonomously form plans for taking over the world. LLMs are trained to write text that would be maximally unsurprising if found on the internet. These two things are fundamentally not the same thing. Why, exactly, would we expect that a system that is good at the latter necessarily would be able to do the former? Could you get a system that can bring about atomically precise manufacturing, Dyson spheres, and computronium, from a system that has been trained to predict the content (in terms of the exact words used) of research papers found on the internet? Could such a system design new computer viruses, run companies, plan military operations, or manipulate people? These tasks are fundamentally very different. If we make a connection between the two, then there could be a risk that we are falling victims to one of the issues discussed in point 1. Remember; historically, people have often assumed that an AI system that can do X, must be able to do Y, but then turned out to be wrong. What gives us a good reason to believe that this is not one of those cases?

Direct Counterarguments

Here, I give some direct arguments against the SIAYN hypothesis, ignoring the arguments in favour of the SIAYN hypothesis.

7. The Language of Thought

This is an argument first made by the philosopher, linguist, and cognitive scientist Jerry Fodor, and was originally applied to the human brain. However, the argument can be applied to AI systems as well.

An intelligent system which can plan and reason must have a data structure for representing facts about and/​or states of the world. What can we say about the nature of this data structure? First, this data structure must be able to represent a lot of things, including things that have never been encountered before (both evolutionarily, and in terms of personal experience). For example, you can represent the proposition that there are no elephants on Jupiter, and the proposition that Alexander the Great never visited a McDonalds restaurant, even though you have probably never encountered either of these propositions before. This means that the data structure must be very productive (which is a technical term in this context). Second, there are certain rules which say that if you can represent one proposition, then you can also represent some other proposition. For example, if you can represent a blue block on top of a red block, then you can also represent a red block on top of a blue block. This means that the data structure also must be systematic (which is also a technical term).

What kinds of data structures have these properties? The answer, according to Fodor, is that it is data structures with a combinatorial syntax and compositional semantics. In other words, it is data structures where two or more representations can be combined in a syntactic structure to form a larger representation, and where the semantic content of the complex representation can be inferred from the semantic content of its parts. This explains both productivity and systematicity. The human brain (and any AI system with the intelligence of a human) must therefore be endowed with such a data structure for representing and reasoning about the world. This is called the “language of thought” (LoT) hypothesis, because languages (including logical languages and programming languages) have this structure. (But, importantly, the LoT hypothesis does not say that people literally think in a language such as English, it just says that mental representations have a “language like” structure.)

This, in turn, suggests a data structure that is discrete and combinatorial, with syntax trees, etc, and neural networks do (according to the argument) not use such representations. We should therefore expect neural networks to at some point hit a wall or limit to what they are able to do.

I am personally fairly confused about what to think of this argument. I find it fairly persuasive, and often find myself thinking back to it. However, the conclusion of the argument also seems wery strong, in a suspicious way. I would love to see more discussion and examination of this.

8. Programs vs Circuits

This point will be similar to point 7, but stated somewhat differently. In short, neural network models are like circuits, but an intelligent system would need to use hypotheses that are more like programs. We know, from computer science, that it is very powerful to be able to reason in terms of variables and operations on variables. It seems hard to see how you could have human-level intelligence without this ability. However, neural networks do typically not have this ability, with most neural networks (including fully connected networks, CNNs, RNNs, LSTMs, etc) instead being more analogous to Boolean circuits.

This being said, some people have said that transformers and attention models are getting around this limitation, and are starting to reason more in terms of variables. I would love to see more analysis of this as well.

As a digression, it is worth noting that symbolic program induction style machine learning systems, such as those based on inductive logic programming, typically have much, much stronger generalisation than deep learning, from a very small number of data points. For example, you might be able to learn a program for transforming strings from ~5 training examples. It is worth playing around a bit with one of these systems, to see this for yourself. An example of a user friendly version is available here. Another example is the auto-complete feature in Microsoft Excel.

9. Generalisation vs Memorisation

This point has also already been alluded to, in points 4, 7, and 8, but I will here state it in a different way. There is, intuitively, a difference between memorisation and understanding, and this difference is important. By “memorisation”, I don’t mean using a literal lookup table, but rather something that is somewhat more permissive. I will for now not give a formal definition of this difference, but instead give a few examples that gesture at the right concept.

For my first example, consider how a child might learn to get a decent score on an arithmetic test by memorising a lot of rules that work in certain special cases, but without learning the rules that would let it solve any problem of arithmetic. For example, it might memorise that multiplication by 0 always gives 0, that multiplication by 1 always gives the other number, that multiplication of a single-digit integer by 11 always gives the integer repeated twice, and so on. There is, intuitively, an important sense in which such a child does not yet understand arithmetic, even though they may be able to solve many problems.

For my second example, I would like to point out that a fully connected neural network cannot learn a simple identity function in a reasonable way. For example, suppose we represent the input as a bitstring. If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only numbers in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network. This means that the behaviour of a fully connected network, in my opinion, is better characterised as memorisation than understanding when it comes to learning an identity function. The same goes for the function that recognises palindromes, and etc. This shows that knowing whether or not a network is able to express and learn a given function is insufficient to conclude that it would be able to understand it. This issue is also discussed in eg this paper.

For my third example, I would like to bring up that GPT-3 can play chess, but not solve a small, verbally described maze. You can easily verify this yourself. This indicates that GPT-3 can play chess just because it has memorised a lot of cases, rather than learnt how to do heuristic search in an abstract state space.

For my fourth example, the psychologist Jean Piaget observed that children that are sufficiently young consistently do not understand conservation of mass. If you try to teach such a child that mass is conserved, then they will under-generalise, and only learn that it holds for the particular substance and the particular containers that you used to demonstrate the principle. Then, at some point, the child will suddenly gain the ability to generalise to all instances. This was historically used as evidence against Skinnerian psychology (aka the hypothesis that humans are tabula rasa reinforcement learning agents).

These examples all point to a distinction between two modes of learning. It is clear that this distinction is important. However, the abstractions and concepts that we currently use in machine learning make it surprisingly hard to point to this distinction in a clear way. My best attempt at formalising this distinction in more mathematical terms (off the top of my head) is that a system that understands a problem is able to give (approximately) the right output (or, perhaps, a “reasonable” output) for any input, whereas a system that has memorised the problem only gives the right output for inputs that are in the training distribution. (But there are also other ways to formalise this.)

The question, then, is whether LLMs do mostly memorisation, or mostly understanding. To me, it seems as though this is still undecided. I should first note that a system which has been given such an obscenely large amount of training data as GPT-3 will be able to exhibit very impressive performance even if much of what it does is more like memorisation than understanding. There is evidence in both directions. For example, the fact that it is possible to edit an LLM to make it consistently believe that the Eiffel Tower is in Rome is evidence that it understands certain facts about the world. However, the fact that GPT-3 can eg play chess, but not solve a verbally described maze, is evidence that it relies on memorisation as well. I would love to see a more thorough analysis of this.

As a slight digression, I currently suspect that this distinction might be very important, but that current machine learning theory essentially misses it completely. My characterisation of “understanding” as being about off-distribution performance already suggests that the supervised learning formalism in some ways is inadequate for capturing this concept. The example with the fully connected network and the identity function also shows the important point that a system may be able to express a function, but not “understand” that function.

10. Catastrophic Forgetting

Here, I just want to add the rather simple point that we currently cannot actually handle memory and dynamicism in a way that seems to be required for intelligence. LLMs are trained once, on a static set of data, and after their training phase, they cannot commit new knowledge to their long-term memory. If we instead try to train them continuously, then we run into the problem of catastrophic forgetting, which we currently do not know how to solve. This seems like a rather important obstacle to general intelligence.

Closing Words

In summary, there are several good arguments against the SIAYN hypothesis. First, there are several reasons to have high model uncertainty about AI timelines, even in the presence of strong inside-view models. In particular, people have historically been bad at predicting AI development, have historically had a tendency to overestimate language-based systems, and failed to account for the fallacy of the successful first step. Second, the argument that is most commonly used in favour of the SIAYN hypothesis fails, at least in the form that it is most often stated. In particular, the simple version of the scaling argument leaves out the scaling rate (which is crucial), and there are reasons to be skeptical that scaling will continue indefinitely, and that next-token prediction would give rise to all important cognitive capacities. Third, there are also some direct reasons to be skeptical of the SIAYN hypothesis (as opposed to the argument in favour of the SIAYN hypothesis). Many of these arguments amount to arguments against deep learning in general.

In addition to all of these points, I would also like to call attention to some of the many “simple” things that GPT-3 cannot do. Some good examples are available here, and other good examples can be found in many places on the internet (eg here). You can try these out for yourself, and see how they push your intuitions.

I should stress that I don’t consider any of these arguments to strongly refute either the SIAYN hypothesis, or short timelines. I personally default to a very high-uncertainty model of AI timelines, with a decent amount of probability mass on both the short timeline and the long timeline scenario. Rather, my reason for writing this post is just to make some of these arguments better known and easier to find for people in the AI safety community, so that they can be used to inform intuitions and timeline models.

I would love to see some more discussion of these points, so if you have any objections, questions, or additional points, then please let me know in the comments! I am especially keen to hear additional arguments for long timelines.