Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.
This is because ethics isn’t science, it doesn’t “hit back” when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
Defending against this kind of “sycophancy++” failure mode doesn’t look like defending against scheming. It looks like solving outer alignment really well.
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn’t nearly as true.
I can see an argument for “outer alignment is also important, e.g. to avoid failure via sycophancy++”, but this doesn’t seem to disagree with this post? (I understand the post to argue what you should do about scheming, rather than whether scheming is the focus.)
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn’t nearly as true.
I don’t understand why this is true (I don’t claim the reverse is true either). I don’t expect a great deal of correlation / implication here.
The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.
This is because ethics isn’t science, it doesn’t “hit back” when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
I’d say the main reason for this is that morality is relative, and much more importantly, morality is much, much more choosable than physics, which means that where it ends up is less determined than in the case of physics.
The crux IMO is that this sort of general failure mode is much more prone to iterative solutions, whereas scheming doesn’t, so I expect it to be solved well enough in practice, so I don’t think we need to worry about non-scheming failure modes that much (except in the cases where it sets us up for even bigger failures of humans controlling AI/the future).
I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.
I have a lot of implicit disagreements.
Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.
This is because ethics isn’t science, it doesn’t “hit back” when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
Defending against this kind of “sycophancy++” failure mode doesn’t look like defending against scheming. It looks like solving outer alignment really well.
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn’t nearly as true.
I can see an argument for “outer alignment is also important, e.g. to avoid failure via sycophancy++”, but this doesn’t seem to disagree with this post? (I understand the post to argue what you should do about scheming, rather than whether scheming is the focus.)
I don’t understand why this is true (I don’t claim the reverse is true either). I don’t expect a great deal of correlation / implication here.
The second thing impacts the first thing :) If a lot of scheming is due to poor reward structure, and we should work on better reward structure, then we should work on scheming prevention.
I’d say the main reason for this is that morality is relative, and much more importantly, morality is much, much more choosable than physics, which means that where it ends up is less determined than in the case of physics.
The crux IMO is that this sort of general failure mode is much more prone to iterative solutions, whereas scheming doesn’t, so I expect it to be solved well enough in practice, so I don’t think we need to worry about non-scheming failure modes that much (except in the cases where it sets us up for even bigger failures of humans controlling AI/the future).
I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.