But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn’t just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely disconnected from evolution’s “objective” (genetic fitness), using birth control and artificial sweeteners.
AI will face the same type of transition: from helpful chatbot → to system with options to self-replicate, take over, pursue goals without oversight. It’s essentially guaranteed that there will be better ways for it to fulfill its preferences once it is in this new environment. And crucially, you cannot test for this shift in a meaningful way.
You can’t test what a model would do as emperor. If you give it power incrementally, you will still hit a critical threshold eventually. If you try honeypot scenarios where you trick it into thinking it has real power, you’re also training it to detect evals. Imagine trying to test what humans would do if they were president: you’d abduct them and put them in a room with an actor pretending this random human is the president now. That would be insane and the subject wouldn’t believe the scenario.
The selective breeding analogy assumes away the hardest part of the problem: that the environment shift we care about is fundamentally untestable until it’s too late.
Why Evolution Beats Selective Breeding as an AI Analogy
MacAskill argues in his critique of IABIED we can “see the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments.” Paul Christiano expresses similar optimism: “Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I’m like, I don’t know man, that seems like it might work.”
But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn’t just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely disconnected from evolution’s “objective” (genetic fitness), using birth control and artificial sweeteners.
AI will face the same type of transition: from helpful chatbot → to system with options to self-replicate, take over, pursue goals without oversight. It’s essentially guaranteed that there will be better ways for it to fulfill its preferences once it is in this new environment. And crucially, you cannot test for this shift in a meaningful way.
You can’t test what a model would do as emperor. If you give it power incrementally, you will still hit a critical threshold eventually. If you try honeypot scenarios where you trick it into thinking it has real power, you’re also training it to detect evals. Imagine trying to test what humans would do if they were president: you’d abduct them and put them in a room with an actor pretending this random human is the president now. That would be insane and the subject wouldn’t believe the scenario.
Apollo Research found that Claude Sonnet 3.7 “often knows when it’s in alignment evaluations,” and Anthropic’s system card for Claude Sonnet 4.5 documents “evaluation awareness” as a capability the model has developed. If you actually try to create realistic environments where the model is tricked into believing it could take over, and then train it to be nice, you’ll just train it to detect such evals and behave appropriately only in those cases.
The selective breeding analogy assumes away the hardest part of the problem: that the environment shift we care about is fundamentally untestable until it’s too late.