Why Copilot Accelerates Timelines

“Say we have intelligences that are narrowly human /​ superhuman on every task you can think of (which, for what it’s worth, I think will happen within 5-10 years). How long before we have self-replicating factories? Until foom? Until things are dangerously out of our control? Until GDP doubles within one year? In what order do these things happen?” (source)

When discussing Takeoff Speeds, I feel the debate often gets stuck in some kind of false dichotomy between Fast and Slow, where the crux seems to be about whether some self-improving AI would be able to foom without human assistance.

Instead, we could get a Moderate Takeoff (think months or years), where AI does not self-improve (by itself). Instead, there would be a reinforcing feedback loop where progress in AI leads to AI becoming increasingly useful to make progress in AI, with humans in the loop at all times.

On top of that, things might just happen privately at some AI lab for a few months until the AI is able to foom by itself, which will look like foom for everyone outside that lab.

AI Helping Humans with AI

In Superintelligence, takeoff is defined as the period between AGI and superintelligence. In this post, I will be using as takeoff’s starting point the first “AI Helping Humans with AI” (in a meaningful way), or AIHHAI for short, since it will arise before we get fully general intelligence and accelerate AI progress. Here are some examples of what I have in mind for “helping humans in a meaningful way”:

  • GPT-N that you can prompt with “I am stuck with this transformer architecture trying to solve problem X”. GPT-N would be AIHHAI if it answers along the lines of “In this arXiv article, they used trick Z to solve problems similar to X. Have you considered implementing it?”, and using an implementation of Z would solve X >50% of the time.

  • Another example would be if some code generation tool like Copilot makes ML engineers substantially more productive in writing ML code. Making predictions about productivity is tricky and hard to measure, but it would involve something like accepting code suggestions a decent amount, like 100x more than what engineers using Copilot currently accept.

(Sidenote: My general impression from people using Copilot is that they believe it’s becoming increasingly useful, and use it on a daily basis, though it rarely auto-completes the right line of code right away. Given that we had Codex/​Copilot last year, and that Sam Altman hinted at some new Codex capabilities in his ACX Q&A[2], I think we will get some impressive release for Copilot/​Codex sometime this year that most engineers will want to use. (Similar to how a lot of developers are used to using Visual Studio’s suite, especially intellisense.) The model I have in mind for “AI helping humans with AI” could be this one, though it will probably require 1-2x more iterations.)

Moderate Takeoff

A Moderate Takeoff is defined[1] as “one that occurs over some intermediary temporal interval, such as months or years”. For AIHHAI, we can distinguish two cases:

  1. AIHHAI is developed by some AI lab working on it privately. That lab has a lead compared to the other labs, since they are working more productively using AIHHAI. Thus, they might reach superintelligence first, without allowing enough time for the rest of the world to compete.

  2. AIHHAI is made public, or quickly (think months) reproduced by others publicly or privately. In any case, some AIHHAI model is eventually made public, and there is not only one group using AIHHAI—other companies are adopting the same strategy (multipolar scenario).

For the first case, you can think of OpenAI using a new version of Copilot internally, that enables their team to quickly build another, even better version of Copilot without releasing the intermediate stage publicly. They would already have a couple of months of lead time in terms of engineering, and after using their latest version, the lead (in terms of how long it would take for a competitor to catch up using publicly available tools) would increase over time due to the compounding advantage.

For the second case, you could consider a similar scenario where they do a public release, or other labs like Google Research just build something equivalent a couple of months after. Even if is not public, the thing might be so impressive that employees talk about it to their friends or anonymously, and the news eventually gets leaked. In that regime, you get many companies possibly expanding capabilities in Code Generation, possibly by scaling models and datasets aggressively. The AI race becomes an engineering race, though we might need more scientific breakthroughs for scaling laws to continue (for more on this, see section “Why Code Generation might Plateau Instead” at the end).

It is unclear to me which case is more likely. On the one hand, the usefulness of AIHHAI would cause some rapid self-improvement of the system {Humans developing the AI + AI}, and the pace would be so quick the model would not have time to leak. On the other hand, the results being exciting enough increases the probability of the news getting leaked and that other players start (were already?) investing in similar models heavily.

Self-improving (Humans + AI)

One thing I have not seen discussed a lot is how the system “humans + AI” could have different takeoff speeds, where, for this hybrid system, takeoff would basically mean “going from {human + AIHHAI} to {human + superintelligent AI}”.

Note that there are many such systems we could study, such as:

  • System 1. {All humans + all used resources, including ML models and science}

  • System 2. {All humans working on code generation + all resources they use}

  • System 3. {Employees working on Copilot-(K +1)/​GPT-(N+1) + Copilot-K/​GPT-N}

The importance of closed systems

Thinking about how “closed” or “small” a system is helps us to understand its kinetics, and also has some implications regarding AI races.

Indeed, a small system using only its own output as input could independently foom, without encountering major bottlenecks. Conversely, it your system requires a lot of insights in mathematics or engineering to overcome bottlenecks, foom becomes less likely. However, a smaller model with less humans might have less resources, labor-wise.

With regard to races, if your system does not require the output of other disciplines to make progress, you could keep it private for longer. (If the system required a lot of insights, publishing preliminary results about the system could prove necessary to get the outside world to publish research relevant to your system.) In practice:

  • System 1 is a closed system. Thinking about how fast it would improve basically brings us back to “when will GDP double in a year” territory. The abstraction is not precise enough to give insights about kinetics without basically studying macro-economics.

  • System 2 is not closed, since it actually uses insights from other discipline in CS/​Math and others as inputs. That said, the research in code generation directly helps the humans doing work relevant to code generation (assuming they use it).

  • System 3 would also definitely need to take research and tools from somewhere else as inputs, though you could assume that, as N gets bigger, most of insights on how to debug deep learning models would be actually fed to Copilot-N’s training data via telemetry (or would be accessible via GPT-N’s Q&A interface).

Among the systems presented above, System 3 could experience exponential self-improvement in complete stealth mode and is therefore worth studying in more details.

Self-improving Code Generation

I am especially interested in Sstem 3 (==”Employees working on Copilot-(K +1)/​GPT-(N+1) + Copilot-K/​GPT-N”), because progress in AIHHAI straightforwardly leads to productivity increases in developing AIHHAI.

Let’s imagine that Copilot-N successfully auto-completes 1% of code lines, and for the sake of argument people immediately press “Tab” to move to the next line in those cases. Without thinking about the fact that the auto-completed parts would actually be the easiest parts of the developer’s pre-existing workload, this would make developers ~1% more productive.

You would get a 1.01 multiplier in productivity, that would make the speed of development 1.01x faster, especially the development of a Copilot-(N+1), which would in turn imply 2% more “perfect auto-complete” than what we started with, etc.

Obviously, the Copilot we have right now is still pretty rudimentary. It is mostly useful for beginners to an API or language, not for doing cutting edge PyTorch development. And you could say that a lot of ML work is done outside of coding, like reading papers and building infrastructure. (More on this in my Codex Skeptic FAQ).

I agree that improvements in productivity from AI are currently marginal, though one should consider what those improvements might be for future versions, including things like question-answering GPT-N helping to debug high-level problems. It is also important to keep in mind that engineers from many different fields are currently using Copilot regularly, and could benefit more from code generation than ML engineers (think web programmers). Those engineers would in turn accelerate GDP growth, which would fasten the total amount of investments in AI.

How that would actually lead to Foom

When we will get from marginal improvements in Code Generation to some Q&A language model that helps you re-consider your transformer architecture, the gains of productivity will start to be more substantial.

Assuming we are in the scenario where one company (think OpenAI) has access to increasingly better code generation tools (that no one else has access to), and possibly also some lead in terms of useful language models to debug their tools, they might get a bigger and bigger lead in how useful their AIHHAI is.

At some point, you would be able to ask more open questions, solving harder and harder tasks, for complex things like making money in financial markets, or just setting strategy for the entire company. In a matter of months, the company would achieve extraordinary economic output, re-investing everything into AIHHAI.

Eventually, the AIHHAI would be optimizing developer productivity over some time horizon, not just completing the next line of code. When something like planning is implemented (eg. expected reward maximization), the AIHHAI might just Foom by modifying its own code to generate code better.

Why Code Generation might Plateau instead

As Kaplan mentions in his recent talk about the implications of Scaling Laws for Code Generation, current progress is bottlenecked by:

  1. Data available. If you remove duplicates, you have about 50B tokens of Python code on Github. In comparison, GPT-3 was trained on about 300B tokens. You could possibly do data augmentation or transfer learning to bypass this problem. Though Kaplan also guesses that in AlphaCode, researchers were also bottlenecked by dataset size when scaling things up. On top of that, the Chinchilla paper shows that scaling data about as much as model size is also necessary for compute-optimal training.

  2. Writing longer programs. Assuming you have a constant error rate when writing your program token by token, you get an exponential decay in how likely your program is to solve the problem. (They tested this by asking a model to write longer programs doing essentially the same thing, and they got an exponentially worse “pass rate”.) Therefore, asking Codex to write very programs might plateau even when scaling models, at least with our current methods. (Kaplan mentions that probably a method would imply doing what humans do, aka writing bad code until it works, instead of just asking the model to write one long piece of code.)

Conclusion

  1. Moderate Takeoffs (think months) are a useful abstraction to think about scenarios between Foom and Slow Takeoffs (years, decades).

  2. When discussing Takeoff speed, it is worth noting that progress can be heterogeneous between what happens privately and publicly, especially as we get closer to superintelligence. This is especially true when considering humans will be using the AIs they developed to build AI even faster.

  3. More generally, discussion on Takeoff Speed has historically focused on whether an AI would be able to Foom, when in practice there will be an intermediate regime where the system {the humans building the AI + the AI} will self-improve, not the AI by itself.

  4. Even if this intermediate regime might, through compounding progress, lead to Foom, our current understanding of scaling laws predicts that we will soon be bottlenecked by dataset size and programs that cannot be longer than a certain size.

(Acknowledgements: thanks to the edits suggested by Justis, facilitated by Ruby.)

  1. ^

    (Superintelligence, Bostrom) Chapter 4.

  2. ^

    Sam did a Q&A for an Astral Codex Ten meetup in September 2021. I will not be linking to the post doing a recap of what he said since it was taken down from LW.

    Sam’s take on Codex was summarized in the post as: a) current codex is bad compared to what they will have next b) they are making fast progress c) Codex is <1y away from having a huge impact on developers.