I’m not going to respond to everything you’re saying here right now. It’s pretty likely I won’t end up responding to everything you’re saying at any point; so apologies for that.
Here are some key claims I want to make:
Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I’m not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn’t seem impossible.
Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we’ll have access to the exact AIs we’re worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond “patches”. (The hope here is similar to model organisms, but somewhat more general.)
The research target can be trusted human level systems instead of superhuman systems. One possible story for victory goes something like “control of untrusted AIs → trustworthy human level (or slightly superhuman AIs) → [some next target like fully scalable alignment]”. If human researchers are literally fully obsoleted by reasonably trustworthy human level AIs and these AIs can collectively speed up any field of alignment by >30x, we should be in a radically better position. These trustworthy AI researchers could work on fully scalable alignment, control or alignment of the next generation of smarter AIs, or some combination of both of these. Researching how to make trusted human level AIs seems much more tractable than researching how to align wildly superhuman systems (though both are hard to measure).
Depending on just control results in a very bad absolute level of risk, but it still might be our best option. I estimated 1-5% doom per year above, but my exact guess will vary depending on various factors. So coordination to do better than this would be great.
Here are some other less important claims which feed into my overall takes:
Current AIs aren’t useful for theory yet partially because they’re too dumb. They suck at math.
I think part of the problem with current AIs is getting the AIs enough context, but I think this is a doable technical problem which probably needs to be solved for AIs to be human level and I expect this to be solved prior to AIs taking over.
My understanding is that the current work done by ARC theory results in them trying to solve a bunch of relatively straightforward math problems, and if they could solve all of these problems very quickly, this would considerably massively accelerate their work. I expect this to be roughly true going forward due to my understanding of their methodology, but I’m not very confident here.
AIs have other structural advantages beyond serial speed which will make speeding things up with AIs relatively easier than with humans.
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful]. Perhaps this is true for ARC—that’s encouraging (though it does again make me wonder why they don’t employ more mathematicians—surely not all the problems are serial on a single critical path?). I’d guess it’s less often true for MIRI and John.
Of course once there’s a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn’t seem implausible.
″...in the future we’ll have access to the exact AIs we’re worried about.”:
We’ll have access to the ones we’re worried about deploying.
We won’t have access to the ones we’re worried about training until we’re training them.
I do buy that this makes safety work for that level of AI more straightforward—assuming we’re not already dead. I expect most of the value is in what it tells us about a more general solution, if anything—similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
“Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems”:
This isn’t clear to me. I’d guess that the same fundamental understanding is required for both. “trustworthy” seems superficially easier than “aligned”, but that’s not obvious in a general context. I’d expect that implementing the trustworthy human-level version would be a lower bar—butthat the same understanding would show us what conditions would need to obtain in either case. (certainly I’m all for people looking for an easier path to the human-level version, if this can be done safely—I’d just be somewhat surprised if we find one)
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
As far as expedient, something like:
Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on “magic” (speculative future science which hasn’t yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.
Both of these are unlikely to perfectly succeed, but seems like good directions to push on.
I think pushing for AI lab scaling pauses is probably net negative right now, but I don’t feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.
Thanks, this seems very reasonable. I’d missed your other comment. (Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
Corresponding comment text:
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful].
I think I disagree with what you meant, but not that strongly. It’s not that important, so I don’t really want to get into it. Basically, I don’t think that “well-defined” is that important (not obviously required for some ability to judge the finished work) and I don’t think “re-direction frequency” is the right way to think about.
I’m not going to respond to everything you’re saying here right now. It’s pretty likely I won’t end up responding to everything you’re saying at any point; so apologies for that.
Here are some key claims I want to make:
Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I’m not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn’t seem impossible.
Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we’ll have access to the exact AIs we’re worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond “patches”. (The hope here is similar to model organisms, but somewhat more general.)
The research target can be trusted human level systems instead of superhuman systems. One possible story for victory goes something like “control of untrusted AIs → trustworthy human level (or slightly superhuman AIs) → [some next target like fully scalable alignment]”. If human researchers are literally fully obsoleted by reasonably trustworthy human level AIs and these AIs can collectively speed up any field of alignment by >30x, we should be in a radically better position. These trustworthy AI researchers could work on fully scalable alignment, control or alignment of the next generation of smarter AIs, or some combination of both of these. Researching how to make trusted human level AIs seems much more tractable than researching how to align wildly superhuman systems (though both are hard to measure).
Depending on just control results in a very bad absolute level of risk, but it still might be our best option. I estimated 1-5% doom per year above, but my exact guess will vary depending on various factors. So coordination to do better than this would be great.
Here are some other less important claims which feed into my overall takes:
Current AIs aren’t useful for theory yet partially because they’re too dumb. They suck at math.
I think part of the problem with current AIs is getting the AIs enough context, but I think this is a doable technical problem which probably needs to be solved for AIs to be human level and I expect this to be solved prior to AIs taking over.
My understanding is that the current work done by ARC theory results in them trying to solve a bunch of relatively straightforward math problems, and if they could solve all of these problems very quickly, this would considerably massively accelerate their work. I expect this to be roughly true going forward due to my understanding of their methodology, but I’m not very confident here.
AIs have other structural advantages beyond serial speed which will make speeding things up with AIs relatively easier than with humans.
This is clarifying, thanks.
A few thoughts:
“Serial speed is key”:
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful].
Perhaps this is true for ARC—that’s encouraging (though it does again make me wonder why they don’t employ more mathematicians—surely not all the problems are serial on a single critical path?).
I’d guess it’s less often true for MIRI and John.
Of course once there’s a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn’t seem implausible.
″...in the future we’ll have access to the exact AIs we’re worried about.”:
We’ll have access to the ones we’re worried about deploying.
We won’t have access to the ones we’re worried about training until we’re training them.
I do buy that this makes safety work for that level of AI more straightforward—assuming we’re not already dead. I expect most of the value is in what it tells us about a more general solution, if anything—similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
“Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems”:
This isn’t clear to me. I’d guess that the same fundamental understanding is required for both. “trustworthy” seems superficially easier than “aligned”, but that’s not obvious in a general context.
I’d expect that implementing the trustworthy human-level version would be a lower bar—but that the same understanding would show us what conditions would need to obtain in either case. (certainly I’m all for people looking for an easier path to the human-level version, if this can be done safely—I’d just be somewhat surprised if we find one)
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
As far as expedient, something like:
Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on “magic” (speculative future science which hasn’t yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.
Both of these are unlikely to perfectly succeed, but seems like good directions to push on.
I think pushing for AI lab scaling pauses is probably net negative right now, but I don’t feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.
Thanks, this seems very reasonable. I’d missed your other comment.
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
Corresponding comment text:
I think I disagree with what you meant, but not that strongly. It’s not that important, so I don’t really want to get into it. Basically, I don’t think that “well-defined” is that important (not obviously required for some ability to judge the finished work) and I don’t think “re-direction frequency” is the right way to think about.