I agree that it seems best for people to define the concept whenever they use it.
Instead of tabooing it, we could declare a canonical definition. I think the best candidate is something like: there is a compute overhang to the extent that the largest training runs could quickly be scaled up.
This proposal has little to do with hard vs. soft takeoff, which (IIRC) was the context in which Bostrom used “hardware overhang” in Superintelligence.
One thing that made the discussion confusing is that Bostrom originally discussed hard vs. soft takeoff as having relevance only after we build AGI, whereas Paul Christiano’s view on soft takeoff introduced the idea that “takeoff” already starts before AGI.
This made me think that it could be useful to distinguish between “post-AGI” and “pre-AGI” compute overhangs. It could go as follows:
Pre-AGI compute overhang:
There’s a pre-AGI compute overhang to the degree that the following could happen: we invent an algorithm that will get us to AGI before we scale up training runs to the biggest attainable sizes (on some short timescale).
So, on this definition, there are two ways in which we might already be in a pre-AGI compute overhang:
(1) Timelines are very short and we could get AGI with “current algorithms” (not necessarily GPT_n with zero tweaks, but obvious things to try that require no special insight) with less scaling effort than a Manhattan project.
(2) We couldn’t get AGI with current algorithms via any less-than-maximal scaling effort (and maybe not even with a maximal one – that part isn’t relevant for the claim), but there are highly significant algorithmic insights waiting for us (that we have a realistic chance of discovering). Once we incorporate these insights, we’ll be in the same situation as described in (1).
Post-AGI compute overhang:
Once we build AGI with realistic means, using far-from-optimal algorithms, how much room is there for it to improve its algorithms during “takeoff”/intelligence explosion? “Post-AGI compute overhang” here describes the gap between compute used to build AGI in the first place vs. more efficient designs that AI-aided progress could quickly discover.
[Edit: Correction: “Post-AGI compute overhang” here describes the gap in “intelligence” of the first AGI vs. the “intelligence” of a more efficient design (using the same amount of training compute as that first AGI) that AI-aided progress could quickly discover.]
On that definition, it’s actually quite straightforward that shorter timelines imply a smaller compute overhang (so maybe that’s what Sam Altman meant here).
Yeah, these seem like useful concepts in some contexts too.
I don’t understand this sentence:
“Post-AGI compute overhang” here describes the gap between compute used to build AGI in the first place vs. more efficient designs that AI-aided progress could quickly discover.
It’s the gap between the training compute of ‘the first AGI’ and what?
It’s the gap between the training compute of ‘the first AGI’ and what?
What I had in mind was something like the gap between how much “intelligence” humans get from the compute they first build AGI with vs. how much “intelligence” AGI will get out of the same compute available, once it optimizes software progress for a few iterations.
So, the “gap” is a gap of intelligence rather than compute, but it’s “intelligence per specified quantity of compute.” (And that specified quantity is how much compute we used to build AGI in the first place.)
whereas Paul Christiano’s view on soft takeoff introduced the idea that “takeoff” already starts before AGI.
I came up with a cool term I hope to see used for this. “Singularity Criticality”. In my mind I’m seeing plutonium start to glow as it edges over the line for a critical mass.
What causes this is that AGI is not really a singleton, it is an integrated set of separate components that individually handle different elements of the AGI’s cognition. Note that even “AGI from scaled up LLMs” will still have multiple components: multiple buffers, specialized vision and motion planning modules, long term memory storage, tool modules, and so on.
As a result, long before we know how to build the integrated system, we will have separate “AGI grade” components, and this is the present reality. We have many RL agents that are superhuman in ability and thus AGI grade.
Using those components we can automate/accelerate some of the tasks needed to reach AGI, so progress accelerates even without AGI existing. The existence of pre-AGI POC modules also increases human effort, financial investment, and increase in the production of compute hardware.
Anyways Singularity Criticality is empirical reality, it’s observable.
I agree that it seems best for people to define the concept whenever they use it.
This proposal has little to do with hard vs. soft takeoff, which (IIRC) was the context in which Bostrom used “hardware overhang” in Superintelligence.
One thing that made the discussion confusing is that Bostrom originally discussed hard vs. soft takeoff as having relevance only after we build AGI, whereas Paul Christiano’s view on soft takeoff introduced the idea that “takeoff” already starts before AGI.
This made me think that it could be useful to distinguish between “post-AGI” and “pre-AGI” compute overhangs. It could go as follows:
Pre-AGI compute overhang:
There’s a pre-AGI compute overhang to the degree that the following could happen: we invent an algorithm that will get us to AGI before we scale up training runs to the biggest attainable sizes (on some short timescale).
So, on this definition, there are two ways in which we might already be in a pre-AGI compute overhang:
(1) Timelines are very short and we could get AGI with “current algorithms” (not necessarily GPT_n with zero tweaks, but obvious things to try that require no special insight) with less scaling effort than a Manhattan project.
(2) We couldn’t get AGI with current algorithms via any less-than-maximal scaling effort (and maybe not even with a maximal one – that part isn’t relevant for the claim), but there are highly significant algorithmic insights waiting for us (that we have a realistic chance of discovering). Once we incorporate these insights, we’ll be in the same situation as described in (1).
Post-AGI compute overhang:
Once we build AGI with realistic means, using far-from-optimal algorithms, how much room is there for it to improve its algorithms during “takeoff”/intelligence explosion? “Post-AGI compute overhang” here describes the gap between compute used to build AGI in the first place vs. more efficient designs that AI-aided progress could quickly discover.
[Edit: Correction: “Post-AGI compute overhang” here describes the gap in “intelligence” of the first AGI vs. the “intelligence” of a more efficient design (using the same amount of training compute as that first AGI) that AI-aided progress could quickly discover.]
On that definition, it’s actually quite straightforward that shorter timelines imply a smaller compute overhang (so maybe that’s what Sam Altman meant here).
Yeah, these seem like useful concepts in some contexts too.
I don’t understand this sentence:
It’s the gap between the training compute of ‘the first AGI’ and what?
Oh, yeah, I butchered that entire description.
What I had in mind was something like the gap between how much “intelligence” humans get from the compute they first build AGI with vs. how much “intelligence” AGI will get out of the same compute available, once it optimizes software progress for a few iterations.
So, the “gap” is a gap of intelligence rather than compute, but it’s “intelligence per specified quantity of compute.” (And that specified quantity is how much compute we used to build AGI in the first place.)
whereas Paul Christiano’s view on soft takeoff introduced the idea that “takeoff” already starts before AGI.
I came up with a cool term I hope to see used for this. “Singularity Criticality”. In my mind I’m seeing plutonium start to glow as it edges over the line for a critical mass.
What causes this is that AGI is not really a singleton, it is an integrated set of separate components that individually handle different elements of the AGI’s cognition. Note that even “AGI from scaled up LLMs” will still have multiple components: multiple buffers, specialized vision and motion planning modules, long term memory storage, tool modules, and so on.
As a result, long before we know how to build the integrated system, we will have separate “AGI grade” components, and this is the present reality. We have many RL agents that are superhuman in ability and thus AGI grade.
Using those components we can automate/accelerate some of the tasks needed to reach AGI, so progress accelerates even without AGI existing. The existence of pre-AGI POC modules also increases human effort, financial investment, and increase in the production of compute hardware.
Anyways Singularity Criticality is empirical reality, it’s observable.