This is because ethics isn’t science, it doesn’t “hit back” when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
I’d say the main reason for this is that morality is relative, and much more importantly, morality is much, much more choosable than physics, which means that where it ends up is less determined than in the case of physics.
The crux IMO is that this sort of general failure mode is much more prone to iterative solutions, whereas scheming doesn’t, so I expect it to be solved well enough in practice, so I don’t think we need to worry about non-scheming failure modes that much (except in the cases where it sets us up for even bigger failures of humans controlling AI/the future).
I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.
I’d say the main reason for this is that morality is relative, and much more importantly, morality is much, much more choosable than physics, which means that where it ends up is less determined than in the case of physics.
The crux IMO is that this sort of general failure mode is much more prone to iterative solutions, whereas scheming doesn’t, so I expect it to be solved well enough in practice, so I don’t think we need to worry about non-scheming failure modes that much (except in the cases where it sets us up for even bigger failures of humans controlling AI/the future).
I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.