Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer.
What are some examples of the sorts of “things that change” that I should be imagining changing here?
“We can catch the AI when it’s alignment faking”?
“The AI can’t develop nanotech”?
“The incentives of the overseeing AI preclude collusion with it’s charge.”?
Things like those? Or is this missing a bunch?
It’s not obvious to me why we should expect that there are two dozen things that change all at once when the AI is in the regime where if it tried, it could succeed at killing you.
If capability gains are very fast in calendar time, then sure, I expect a bunch of things to change all at once, by our ability to measure. But if, as in this branch of the conversation, we’re assuming gradualism, then I would generally expect factors like the above, at least, to change one at a time. [1]
One class of things that might change all at once is “Is the expected value of joining an AI coup better than the alternatives” for each individual AI, which could change in a cascade (or a simultaneous moment of agents reasoning with Logical Decision Theory)? But I don’t get the sense that’s the sort of thing that you’re thinking about.
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.
What are some examples of the sorts of “things that change” that I should be imagining changing here?
“We can catch the AI when it’s alignment faking”?
“The AI can’t develop nanotech”?
“The incentives of the overseeing AI preclude collusion with it’s charge.”?
Things like those? Or is this missing a bunch?
It’s not obvious to me why we should expect that there are two dozen things that change all at once when the AI is in the regime where if it tried, it could succeed at killing you.
If capability gains are very fast in calendar time, then sure, I expect a bunch of things to change all at once, by our ability to measure. But if, as in this branch of the conversation, we’re assuming gradualism, then I would generally expect factors like the above, at least, to change one at a time. [1]
One class of things that might change all at once is “Is the expected value of joining an AI coup better than the alternatives” for each individual AI, which could change in a cascade (or a simultaneous moment of agents reasoning with Logical Decision Theory)? But I don’t get the sense that’s the sort of thing that you’re thinking about.
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.