The whole problem is that alignment, as in “AI doesn’t want to take over in a bad way” is not assumed to be solved
That’s a broken way of thinking about it.
Doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of “solving” car safety
for once and all like maths problem, instead it’s assumed to be an engineering problem, an issue of making steady , incremental progress. Good enough alignment is good enough!.
So you think your alignment training works for your current version of pre-takeover ASI, but actually previous versions already schemed for a long time, so running a version capable of takeover suddenly for you creates a discontinuity
I’ll make the point that safety engineering can have discontinuous failure modes. The reason the Challenger collapsed was because some o-ring seals in a booster had gotten too cold before launch, preventing them from sealing off the flow of hot gas to the main engine and blowing up the rocket. The function of these o-rings is pretty binary: either gas is kept in and the rocket works, or it’s let out and the whole thing explodes.
AI research might end up with similar problems. It’s probably true that there is such a thing as good enough alignment, but that doesn’t necessarily imply that progress on solving it can be made incrementally and doesn’t have all or nothing stakes in deployment.
I don’t think anyone is against incremental progress. It’s just that if after incremental progress AI takes over, then it’s not good enough alignment. And what’s the source of confidence in it being enough?
“Final or nonexistent” seems to be appropriate for scheming detection—if you missed only one way for AI to hide it’s intentions, it will take over. So yes, degree of scheming in broad sense and how much you can prevent it is a crux and other things depend on it. Again, I don’t see how you can be confident that future AI wouldn’t scheme.
It’s just that if after incremental progress AI takes over,
Why would that be discontinuous?
if you missed only one way for AI to hide it’s intentions, it will take over.
Assuming it has an intention, and a malign one. Deception depends on a chain of assumptions. They all have to be well over 90% to lead to a conclusion of near certain doom.
Again, I don’t see how you can be confident that future AI wouldn’t scheme.
I’m not arguing for 0% p(doom) , I’m arguing against 99%.
If all AIs are scheming, they can take over together. If a world with a powerful AI that is actually on humanity’s side is assumed instead, then at some level of power of friendly AI you probably can run unaligned AI and it will not be able to do much harm. But just assuming there being many AIs doesn’t solve scheming by itself—if training actually works as bad as predicted, then no AI of many would be aligned enough.
That’s a broken way of thinking about it.
Doomers see AI alignment as a binary, either perfect and final, or non existent. But no other form of safety works like that. No one talks of “solving” car safety for once and all like maths problem, instead it’s assumed to be an engineering problem, an issue of making steady , incremental progress. Good enough alignment is good enough!.
Scheming is an assumption, not a fact.
I’ll make the point that safety engineering can have discontinuous failure modes. The reason the Challenger collapsed was because some o-ring seals in a booster had gotten too cold before launch, preventing them from sealing off the flow of hot gas to the main engine and blowing up the rocket. The function of these o-rings is pretty binary: either gas is kept in and the rocket works, or it’s let out and the whole thing explodes.
AI research might end up with similar problems. It’s probably true that there is such a thing as good enough alignment, but that doesn’t necessarily imply that progress on solving it can be made incrementally and doesn’t have all or nothing stakes in deployment.
Might. IABIED requires a discontinuity to be almost certain.
I don’t think anyone is against incremental progress. It’s just that if after incremental progress AI takes over, then it’s not good enough alignment. And what’s the source of confidence in it being enough?
“Final or nonexistent” seems to be appropriate for scheming detection—if you missed only one way for AI to hide it’s intentions, it will take over. So yes, degree of scheming in broad sense and how much you can prevent it is a crux and other things depend on it. Again, I don’t see how you can be confident that future AI wouldn’t scheme.
Why would that be discontinuous?
Assuming it has an intention, and a malign one. Deception depends on a chain of assumptions. They all have to be well over 90% to lead to a conclusion of near certain doom.
I’m not arguing for 0% p(doom) , I’m arguing against 99%.
Because incremental progress missed deception.
I agree such confidence lacks justification.
I’m talking about the how of takeover. Could any AI, even one of many, take over successfully in its first attempt?
If all AIs are scheming, they can take over together. If a world with a powerful AI that is actually on humanity’s side is assumed instead, then at some level of power of friendly AI you probably can run unaligned AI and it will not be able to do much harm. But just assuming there being many AIs doesn’t solve scheming by itself—if training actually works as bad as predicted, then no AI of many would be aligned enough.
All AI’s scheming co-operatively is less likely than on scheming.