Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn’t make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn’t that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I’m not sure where the cutoff should be, so I said 1%.]
My personal position isn’t CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.
Another obstacle to those strategies is, even if future AGI isn’t sufficiently strategic/agenty/rational to fall into the “rationality attractor”, it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don’t apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.
Of course it doesn’t make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.
One “something” is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there’s more reasoning to be unpacked.
I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.
I basically agree with your main point (and I didn’t mean to suggest that it “[makes] sense to just decide whether CRT seems more true or false and then go from there”).
But I think it’s also suggestive of an underlying view that I disagree with, namely: (1) “we should aim for high-confidence solutions to AI-Xrisk”. I think this is a good heuristic, but from a strategic point of view, I think what we should be doing is closer to: (2) “aim to maximize the rate of Xrisk reduction”.
Practically speaking, a big implication of favoring (2) over (1) is giving a relatively higher priority to research at making unsafe-looking approaches (e.g. reward modelling + DRL) safer (in expectation).
Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn’t make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn’t that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I’m not sure where the cutoff should be, so I said 1%.]
My personal position isn’t CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.
Another obstacle to those strategies is, even if future AGI isn’t sufficiently strategic/agenty/rational to fall into the “rationality attractor”, it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don’t apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.
Of course it doesn’t make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.
One “something” is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there’s more reasoning to be unpacked.
I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.
I basically agree with your main point (and I didn’t mean to suggest that it “[makes] sense to just decide whether CRT seems more true or false and then go from there”).
But I think it’s also suggestive of an underlying view that I disagree with, namely: (1) “we should aim for high-confidence solutions to AI-Xrisk”. I think this is a good heuristic, but from a strategic point of view, I think what we should be doing is closer to: (2) “aim to maximize the rate of Xrisk reduction”.
Practically speaking, a big implication of favoring (2) over (1) is giving a relatively higher priority to research at making unsafe-looking approaches (e.g. reward modelling + DRL) safer (in expectation).