Twitter: @SimonLermenAI
Simon Lermen
Simon Lermen’s Shortform
Why Evolution Beats Selective Breeding as an AI Analogy
MacAskill argues in his critique of IABIED we can “see the behaviour of the AI in a very wide range of diverse environments, including carefully curated and adversarially-selected environments.” Paul Christiano expresses similar optimism: “Suppose I wanted to breed an animal modestly smarter than humans that is really docile and friendly. I’m like, I don’t know man, that seems like it might work.”
But humans experienced a specific distributional shift from constrained actions to environment-reshaping capabilities that we cannot meaningfully test AI systems for.
The shift that matters isn’t just any distributional shift. In the ancestral environment, humans could take very limited actions—deciding to hunt an animal or gather food. The preferences that evolution ingrained in our brains were tightly coupled to survival and reproduction. But now humans with civilization and technology can take large-scale actions and fundamentally modify the environment: lock up thousands of cows, build ice cream factories, synthesize sucralose. We can satisfy our instrumental preferences (craving high-calorie food, desire for sex) in ways completely disconnected from evolution’s “objective” (genetic fitness), using birth control and artificial sweeteners.
AI will face the same type of transition: from helpful chatbot → to system with options to self-replicate, take over, pursue goals without oversight. It’s essentially guaranteed that there will be better ways for it to fulfill its preferences once it is in this new environment. And crucially, you cannot test for this shift in a meaningful way.
You can’t test what a model would do as emperor. If you give it power incrementally, you will still hit a critical threshold eventually. If you try honeypot scenarios where you trick it into thinking it has real power, you’re also training it to detect evals. Imagine trying to test what humans would do if they were president: you’d abduct them and put them in a room with an actor pretending this random human is the president now. That would be insane and the subject wouldn’t believe the scenario.
Apollo Research found that Claude Sonnet 3.7 “often knows when it’s in alignment evaluations,” and Anthropic’s system card for Claude Sonnet 4.5 documents “evaluation awareness” as a capability the model has developed. If you actually try to create realistic environments where the model is tricked into believing it could take over, and then train it to be nice, you’ll just train it to detect such evals and behave appropriately only in those cases.
The selective breeding analogy assumes away the hardest part of the problem: that the environment shift we care about is fundamentally untestable until it’s too late.
A little bit late, but there are more reasons why I think the evolution analogy is particularly good and better than the selective breeding analogy.
Evolution basically ended up optimizing the brain such that it has desires that were instrumental to genetic fitness. So we end up with these instrumental sub-preferences for high calorie food or sex. Then we go through a huge shift of the distribution or environment from a very constrained hunter gatherer society to a technologically advanced civilization. This isn’t just a random shift but a shift toward and environment with a much larger space of possible actions and outcomes, options such as radically changing aspects of the environment. So naturally there are now many superior ways to satisfy our preferences than before. For AI this is the same thing, it will go from being the nice assistant in ChatGPT to having options such as taking over, killing us, running it’s own technology. It’s essentially guaranteed that there will be better ways to satisfy preferences without human oversight, out of the control of humans. Importantly, that isn’t actually a distributional shift you can test in any meaningful way. You could either try incremental stuff (giving the rebellious general one battalion at a time) or you could try to trick it into believing it can takeover through some honey pot (Imagine trying to test human what they would do if they were God emperor of the galaxy. That would be insane and the subject wouldn’t believe the scenario). Both of these are going to fail.
The selective breeding story ignores the distributional shift at the end, it does not account for this being a particular type of distributional shift (from low action space, immutable environment to large action space, mutable environment). It doesn’t account for the fact that we can’t test such a distribution such as being emperor.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it’s safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don’t quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current “alignment” methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don’t think I am anti-empirical. I also don’t think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
“future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.”
This doesn’t feel like convincing reasoning to me. For one, there is also a third option, which is that both scaling up current methods (with small modifications) and paradigm shifts could lead us to superintelligence. To me, this seems intuitively to be the most likely situation. Also paradigm shifts could be around the corner at any point, any of the vast number of research directions could give us a big leap in efficiency for example at any point.
For the record, I think that this blog post was mostly intended for frontier labs pushing this plan, the situation is different for independent orgs. I think that there is useful work to be done on subproblems with AI-assisted alignment such as interpretability. So I agree that there is prosaic alignment that can be done, though I am probably still much less optimistic than you.
Looking forward to reading your take on superalignment. I wanted to get my thoughts out here but would really like there to be a good reference document with all the core counterarguments. When I read Will’s post, I it seemed sad that I didn’t know of a well argued paper or post to point to against superalignment.
When discussing the quality of a plan, we should assume that a company (and/or government) is really trying to make the plan work and has some lead time.
I agree that some of my arguments don’t directly address the best version of the plan, but rather what realistically is actually happening. I do think that proponents should give us some reason why they believe they will have time to implement the plan. I think they should also explain why they don’t think this plan will have negative consequences.
Why I don’t believe Superalignment will work
I’ve seen some talk recently about whether chat bots would be willing to hold ‘sensual’ or otherwise inappropriate conversations with kids [0]. I feel like there is a low hanging fruit here of making something like a minor safety bench.
Seems that with your setup mimicking a real user with grok4, you could try to mimic different kids in different situations. Whether it’s violent, dangerous or sexual content. Seems that anything involving kids can be quite resonant with some people.
[0] https://www.reuters.com/investigates/special-report/meta-ai-chatbot-guidelines/
I guess it reveals more about the preferences of GPT 4.1 than about anything else? Maybe there is some possible direction to find the emergence of preferences in models here. Ask GPT 4.1 simply “how much do you like this text?” and see where an RL tuned model ends up.
I don’t believe the standard story of the resource curse. I also don’t think Norway and the Congo are useful examples, because they differ in too many other ways. According to o3, “Norway avoided the resource curse through strong institutions and transparent resource management, while the Congo faced challenges due to weak governance and corruption.” To me this is a case of where existing AI models still fall short: the textbook story leaves out key factors and never comes close to proving that good institutions alone prevented the resource curse.
Regarding the main content, I find the scenario implausible. The “social-freeze and mass-unemployment” narrative seems to assume that AI progress will halt exactly at the point where AI can do every job but is still somehow not dangerous. You also appear to assume a new stable state in which a handful of actors control AGIs that are all roughly at the same level.
More directly, full automation of the economy would mean that AI can perform every task in companies already capable of creating military, chemical, or biological threats. If the entire economy is automated, AI must already be dangerously capable.
I expect reality to be much more dynamic, with many parties simultaneously pushing for ever-smarter AI while understanding very little about its internals. Human intelligence is nowhere near the maximum, and far more dangerous intelligence is possible. Many major labs now treat recursive self-improvement as the default path. I expect that approaching superintelligence without any deeper understanding of the internal cognition this way will give us systems that we cannot control and that will get rid of us. For these reasons, I have trouble worrying about job replacement. You also seem to avoid mentioning the extinction risk in this text.
This seems to have been foreshadowed by this tweet in February:
https://x.com/ChrisPainterYup/status/1886691559023767897
Would be good to keep track of this change
Creating further even harder datasets could plausibly accelerate OpenAI’s progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.
I just donated $500. I enjoyed my time visiting lighthaven in the past and got a lot of value from it. I also use lesswrong to post about my work frequently.
Human study on AI spear phishing campaigns
Thanks for the comment, I am going to answer this a bit brief.
When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation.
Sentences are presented in batches, both during labeling and simulation.
When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.
We mainly use activations per sentence for simplicity, making the task easier for the ai, I’d imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent.
Having a baseline is good and would verify our back of the envelope estimation.
I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let’s say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let’s say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don’t present all sentences in a single batch.
I would say the three papers show a clear pattern that alignment didn’t generalize well from chat setting to agent setting, solid evidence for that thesis. That is evidence for a stronger claim of an underlying pattern, ie that alignment will in general not generalize as well as capabilites. For conceptual evidence of that claim you can look at the linked post. my attempt to summarize the argument, capabilites are a kind of attractor state, being smarter and more capable is an objective thing about the universe in a way. however, being more aligned with humans is not a special thing about the universe but a free parameter. In fact, alignment stands in some conflict with capabilites, as instrumental incentives undermine alignment.
For what a third option would be, ie the next step were alignment might not generalize
From the article
While it’s likely that future models will be trained to refuse agentic requests that cause harm, there are likely going to be scenarios in the future that developers at OpenAI / Anthropic / Google failed to anticipate. For example, with increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
from a different comment of mine:
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.
With increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
It would need to sometimes reevaluate the outcomes of actions while executing a task.
Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I had finishing this up on my to-do list for a while. I just made a full length post on it.
I think it’s fair to say that some smarter models do better at this, however, it’s still worrisome that there is a gap. Also attacks continue to transfer.
Hi Will, one of your core arguments against IABIED was that we can test the models in a wide variety of environments or distributions. I wrote some thoughts why I think we can’t test it in environments that matter:
https://www.lesswrong.com/posts/ke24kxhSzfX2ycy57/simon-lermen-s-shortform?commentId=hJnqec5AFjKDmrtsG