I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.
I agree that in some theoretical infinite-retries game (that doesn’t allow the AI to permanently convince the human of anything), scheming has a much longer half-life than “honest” misalignment. But I’d emphasize your paranthetical. If you use a misaligned AI to help write the motivational system for its successor, or if a misaligned AI gets to carry out high-impact plans by merely convincing humans they’re a good idea, or if the world otherwise plays out such that some AI system rapidly accumulates real-world power and that AI is misaligned, or if it turns out you iterate slowly and AI moves faster than you expected, you don’t get to iterate as much as you’d like.