takes on takeoff (or: Why Aren’t The Models Mesaoptimizer-y Yet)
here are some reasons we might care about discontinuities:
alignment techniques that apply before the discontinuity may stop applying after / become much less effective
makes it harder to do alignment research before the discontinuity that transfers to after the discontinuity (because there is something qualitatively different after the jump)
second order effect: may result in false sense of security
there may be less/negative time between a warning shot and the End
harder to coordinate and slow down
harder to know when the End Times are coming
alignment techniques that rely on systems supervising slightly smarter systems (i.e RRM) depend on there not being a big jump in capabilities
I think these capture 90% of what I care about when talking about fast/slow takeoff, with the first point taking up a majority
(it comes up a lot in discussions that it seems like I can’t quite pin down exactly what my interlocutor’s beliefs on fastness/slowness imply. if we can fully list out all the things we care about, we can screen off any disagreement about definitions of the word “discontinuity”)
some things that seem probably true to me and which are probably not really cruxes:
there will probably be a pretty big amount of AI-caused economic value and even more investment into AI, and AGI in particular (not really a bold prediction, given the already pretty big amount of these things! but a decade ago it may have been plausible nobody would care about AGI until the End Times, and this appears not to be the case)
continuous changes of inputs like compute or investment or loss (not technically an input, but whatever) can result in discontinuous jumps in some downstream metric (accuracy on some task, number of worlds paperclipped)
almost every idea is in some sense built on some previous idea, but this is not very useful because there exist many ideas [citation needed] and it’s hard to tell which ones will be built on to create the idea that actually works (something something hindsight bias). this means you can’t reason about how they will change alignment properties, or use them as a warning shot
possible sources of discontinuity:
breakthroughs: at some point, some group discovers a brand new technique that nobody had ever thought of before / nobody had made work before because they were doing it wrong in some way / “3 hackers in a basement invent AGI”
depends on how efficient you think the research market is. I feel very uncertain about this
importantly I think cruxes here may result in other predictions about how efficient the world is generally, in ways unrelated to AI, and which may make predictions before the End Times
seems like a subcrux of this is whether the new technique immediately works very well or if it takes a nontrivial amount of time to scale it up to working at SOTA scale
overdetermined “breakthroughs”: some technique that didn’t work (and couldn’t have been made to work) at smaller scales starts working at larger scales. lots of people independently would have tried the thing
importantly, under this scenario it’s possible for something to simultaneously (a) be very overdetermined (b) have very different alignment properties
very hard to know which of the many ideas that don’t work might be the one that suddenly starts working with a few more OOMs of compute
at some scale, there is just some kind of grokking without any change in techniques, and the internal structure and generalization properties of the networks changes a lot. trends break because of some deep change in the structure of the network
mostly isomorphic to the previous scenario actually
for example, in worlds where deceptive alignment happens because at x params suddenly it groks to mesaoptimizer-y structure and the generalization properties completely change
at some scale, there is “enough” to hit some criticality threshold of some kind of thing the model already has. the downstream behavior changes a lot but the internal structure doesn’t change much beyond the threshold. importantly while obviously some alignment strategies would break, there are potentially invariants that we can hold onto
for example, in worlds where deceptive alignment happens because of ontology mismatch and ontologies get slowly more mismatched with scale, and then past some threshold it snaps over to the deceptive generalization
I think these can be boiled down to 3 more succinct scenario descriptions:
breakthroughs that totally change the game unexpectedly
mechanistically different cognition suddenly working at scale
takes on takeoff (or: Why Aren’t The Models Mesaoptimizer-y Yet)
here are some reasons we might care about discontinuities:
alignment techniques that apply before the discontinuity may stop applying after / become much less effective
makes it harder to do alignment research before the discontinuity that transfers to after the discontinuity (because there is something qualitatively different after the jump)
second order effect: may result in false sense of security
there may be less/negative time between a warning shot and the End
harder to coordinate and slow down
harder to know when the End Times are coming
alignment techniques that rely on systems supervising slightly smarter systems (i.e RRM) depend on there not being a big jump in capabilities
I think these capture 90% of what I care about when talking about fast/slow takeoff, with the first point taking up a majority
(it comes up a lot in discussions that it seems like I can’t quite pin down exactly what my interlocutor’s beliefs on fastness/slowness imply. if we can fully list out all the things we care about, we can screen off any disagreement about definitions of the word “discontinuity”)
some things that seem probably true to me and which are probably not really cruxes:
there will probably be a pretty big amount of AI-caused economic value and even more investment into AI, and AGI in particular (not really a bold prediction, given the already pretty big amount of these things! but a decade ago it may have been plausible nobody would care about AGI until the End Times, and this appears not to be the case)
continuous changes of inputs like compute or investment or loss (not technically an input, but whatever) can result in discontinuous jumps in some downstream metric (accuracy on some task, number of worlds paperclipped)
almost every idea is in some sense built on some previous idea, but this is not very useful because there exist many ideas [citation needed] and it’s hard to tell which ones will be built on to create the idea that actually works (something something hindsight bias). this means you can’t reason about how they will change alignment properties, or use them as a warning shot
possible sources of discontinuity:
breakthroughs: at some point, some group discovers a brand new technique that nobody had ever thought of before / nobody had made work before because they were doing it wrong in some way / “3 hackers in a basement invent AGI”
depends on how efficient you think the research market is. I feel very uncertain about this
importantly I think cruxes here may result in other predictions about how efficient the world is generally, in ways unrelated to AI, and which may make predictions before the End Times
seems like a subcrux of this is whether the new technique immediately works very well or if it takes a nontrivial amount of time to scale it up to working at SOTA scale
overdetermined “breakthroughs”: some technique that didn’t work (and couldn’t have been made to work) at smaller scales starts working at larger scales. lots of people independently would have tried the thing
importantly, under this scenario it’s possible for something to simultaneously (a) be very overdetermined (b) have very different alignment properties
very hard to know which of the many ideas that don’t work might be the one that suddenly starts working with a few more OOMs of compute
at some scale, there is just some kind of grokking without any change in techniques, and the internal structure and generalization properties of the networks changes a lot. trends break because of some deep change in the structure of the network
mostly isomorphic to the previous scenario actually
for example, in worlds where deceptive alignment happens because at x params suddenly it groks to mesaoptimizer-y structure and the generalization properties completely change
at some scale, there is “enough” to hit some criticality threshold of some kind of thing the model already has. the downstream behavior changes a lot but the internal structure doesn’t change much beyond the threshold. importantly while obviously some alignment strategies would break, there are potentially invariants that we can hold onto
for example, in worlds where deceptive alignment happens because of ontology mismatch and ontologies get slowly more mismatched with scale, and then past some threshold it snaps over to the deceptive generalization
I think these can be boiled down to 3 more succinct scenario descriptions:
breakthroughs that totally change the game unexpectedly
mechanistically different cognition suddenly working at scale
more of the same cognition is different