Software Engineer interested in AI and AI safety.
Stephen McAleese
I think the level of disagreement among the experts implies that there is quite a lot of uncertainty so the key question is how to steer the future toward better outcomes while reasoning and acting under substantial uncertainty.
The framing I currently like best is from Chris Olah’s thread on probability mass over difficulty levels.
The idea is that you have initial uncertainty and a distribution that assigns probability mass to different levels of alignment difficulty.
The goal is to develop new alignment techniques that “eat marginal probability” where over time the most effective alignment and safety techniques can handle the optimistic easy cases, and then the medium and hard cases and so on. I also think the right approach is to think in terms of which actions would have positive expected value and be beneficial across a range of different possible scenarios.
Meanwhile the goal should be to acquire new evidence that would help reduce uncertainty and concentrate probability mass on specific possibilities. I think the best way to do this is to use the scientific method to proposed hypotheses and then test them experimentally.
Thank’s for pointing that out and for the linked post!
I’d say the conclusion is probably the weakest part of the post because after describing the IABIED view and the book’s critics I found it hard to reconcile the two views.
I tried getting Gemini to write the conclusion but what it produced seemed even worse: it suggested that we treat AI like any other technology (e.g. cars, electricity) where doomsday forecasts are usually wrong and the technology can be made safe in an iterative way which seems too optimistic to me.
I think my conclusion was an attempt to find a middle ground between the authors of IABIED and the critics by treating AI as a risky but not world-ending technology.
(I’m still not sure what the conclusion should be)
Yeah that’s probably true and it reminds me of Plank’s principle. Thanks for sharing your experience.
I like to think that this doesn’t apply to me and that I would change my mind and adopt a certain view if a particularly strong argument or piece of evidence supporting that view came along.
It’s about having a scout mindset and not a soldier mindset: changing your mind is not defeat, it’s a way of getting closer to the truth.
I like this recent tweet from Sahil Bloom:
I’m increasingly convinced that the willingness to change your mind is the ultimate sign of intelligence. The most impressive people I know change their minds often in response to new information. It’s like a software update. The goal isn’t to be right. It’s to find the truth.
The book Superforecasting also has as similar idea: the best superforecasters are really good and constantly updating based on new information:
The strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement. It is roughly three times as powerful a predictor as its closest rival, intelligence.
Yes, I agree [1]. At first, I didn’t consider writing this post because I assumed someone else would write a post like it first. The goal was to write a thorough summary of the book’s arguments and then analyze them and the counterarguments in a rigorous and unbiased way. I didn’t find a review that did this so I wrote this post.
Usually a book review just needs to give a brief summary so that readers can decide whether or not they are interested in reading the book and there are a few IABIED book reviews like this.
But this post is more like an analysis of the arguments and counterarguments than a book review. I wanted a post like this because the book’s arguments have really high stakes and it seems like the right thing to do for a third party to review and analyze the arguments in a rigorous and high-quality way.
- ^
Though I may be biased.
- ^
IABIED Book Review: Core Arguments and Counterarguments
For human learning, the outer objective in the brain is maximizing hard-coded reward signals and minimizing pain and the brain’s inner objectives are the specific habits and values that determine behavior directly and are somewhat aligned with the goals of maximizing pleasure and minimizing pain.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this:
1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses.
2. Train the RLHF reward model on this data.
3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with “You’re right!”).In the original emergent misalignment paper they:
1. Finetune the model to output insecure code
2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.Sycophantic AI doesn’t seem that surprising because it’s a special case of reward hacking in the context of LLMs and reward hacking isn’t new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
The resulting behavior would likely receive low reward under a standard HHH RLHF reward model, which makes it closer to goal misgeneralization than to exploiting a proxy for high reward.
The way EM is explained and mitigated is not generally explained using standard RL terminology; the literature emphasizes novel concepts like “personas”, and mitigation techniques such as “inoculation prompting” that seem specific to LLMs.
The apparent cause of EM is the amplification of misaligned internal persona directions that already exist in the base model and originate from the pretraining data, and are triggered by finetuning or prompting causing the model to adopt a globally misaligned persona.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think “Sydney” Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I’d consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
The AI company Mechanize posted a blog post in November called “Unfalsifiable stories of doom” which is a high-quality critique of AI doom and IABIED that I haven’t seen shared or discussed on LessWrong yet.
Link: https://www.mechanize.work/blog/unfalsifiable-stories-of-doom/
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection; the “alien drives” evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
Thanks for the post. The importance of reward function design for solving the alignment problem is worth emphasizing.
I’m wondering how you research fits into other reward function alignment research such as CHAI’s research on CIRL and inverse reinforcement learning, and reward learning theory.
It seems like these other agendas are focused on using game theory or machine learning fundamentals to come up with a new RL approach that makes AI alignment easier whereas your research is more focused on the intersection of neuroscience and RL.
This post reminded me of a book called The Technological Singularity (2015) by Murray Shanahan that also emphasizes the importance of reward function design for advanced AI. Relevant extract from the book:
“In the end, everything depends on the AI’s reward function. From a cognitive standpoint, human-like emotions are a crude mechanism for modulating behavior. Unlike other cognitive attributes we associate with consciousness, there seems to be no logical necessity for an artificial general intelligence to behave as if it had empathy or emotion. If its reward function is suitably designed, then its benevolence is assured. However, it is extremely difficult to design a reward function that is guaranteed not to produce undesirable behavior. As we’ll see shortly, a flaw in the reward function of a superintelligent AI could be catastrophic. Indeed such a flaw could mean the difference between a utopian future of cosmic expansion and unending plenty, and a dystopian future of endless horror, perhaps even extinction.”
Shallow review of technical AI safety, 2025
Nice post. Approval reward seems like it helps explain a lot of human motivation and behavior.
I’m wondering whether approval reward would really be a safe source of motivation in an AGI though. From the post, it apparently comes from two sources in humans:
Internal: There’s an internal approval reward generator that rewards you for doing things that other people would approve of, even if no one is there. “Intrinsically motivated” sounds very robust but I’m concerned that this just means that the reward is coming from an internal module that is possible to game.
External: Someone sees you do something and you get approval.
In each case it seems the person is generating behaviors and they there is an equally strong/robust reward classifier internally or externally so it’s hard to game.
The internal classifier is hard to game because we can’t edit our minds.
And other people are hard to fool. For example, there are fake billionaires but they are usually found out and then get negative approval so it’s not worth it.
But I’m wondering would an AGI with an approval reward modify itself to reward hack or figure out how to fool humans in clever ways (like the RLHF robot arm) to get more approval.
Though maybe implementing an approval reward in an AI gets you most of the alignment you need and it’s robust enough.
Epoch AI has a map of frontier AI datacenters: https://epoch.ai/data/data-centers/satellite-explorer
Death and AI successionism or AI doom are similar because they feel difficult to avoid and therefore it’s insightful to analyze how people currently cope with death as a model of how they might later cope with AI takeover or AI successionism.
Regarding death, similar to what you described in the post, I think people often begin with a mindset of confused, uncomfortable dissonance. Then they usually converge on one of a few predictable narratives:
1. Acceptance: “Death is inevitable, so trying to fight it is pointless.” Given the inevitability and unavoidability of death, worrying about it or putting effort into avoiding it is futile and pointless. Just swallow the bitter truth and go on living.
2. Denial: Avoiding the topic or distracting oneself from the implications.
3. Positive reframing: Turning death into something desirable or meaningful. As Eliezer Yudkowsky has pointed out, if you were hit on the head with a baseball bat every week, you’d eventually start saying it built character. Many people rationalize death as “natural” or essential to meaning.
Your post seems mostly about mindset #3: AI successionism framed as good or even noble. I’d expect #2 and #3 to be strong psychological attractors as well, but based on personal experience, #1 seems most likely.
I see all three as cognitive distortions: comforting stories designed to reduce dissonance rather than finding an accurate model of reality.
A more truth-seeking and honest mindset is to acknowledge unpleasant realities (death, AI risk), that these events may be likely but not guaranteed, and then ask what actions increase the probability of positive outcomes and decrease negative ones. This is the kind of mindset that is described in IABIED.
I also think a good heuristic is to be skeptical of narratives that minimize human agency or suppress moral obligations to act (e.g. “it’s inevitable so why try”).
Another one is the imminent prediction that AI progress will soon stop or plateau because of diminishing returns or limitations of the technology. Even a professor I know believed that.
I think that’s a possibility but I think this belief is usually a consequence of wishful thinking and status quo bias rather than carefully examining the current evidence and trajectory of the technology.
In 2022 I wrote an article that is relevant to this question called How Do AI Timelines Affect Existential Risk? Here is the abstract:
Superhuman artificial general intelligence could be created this century and would likely be a significant source of existential risk. Delaying the creation of superintelligent AI (ASI) could decrease total existential risk by increasing the amount of time humanity has to work on the AI alignment problem.
However, since ASI could reduce most risks, delaying the creation of ASI could also increase other existential risks, especially from advanced future technologies such as synthetic biology and molecular nanotechnology.
If AI existential risk is high relative to the sum of other existential risk, delaying the creation of ASI will tend to decrease total existential risk and vice-versa.
Other factors such as war and a hardware overhang could increase AI risk and cognitive enhancement could decrease AI risk. To reduce total existential risk, humanity should take robustly positive actions such as working on existential risk analysis, AI governance and safety, and reducing all sources of existential risk by promoting differential technological development.Artificial Intelligence as a Positive and Negative Factor in Global Risk (Yudkowsky, 2008) is also relevant. Excerpt from the conclusion:
Yet before we can pass out of that stage of adolescence, we must, as adolescents,
confront an adult problem: the challenge of smarter-than-human intelligence. This is
the way out of the high-mortality phase of the life cycle, the way to close the window
of vulnerability; it is also probably the single most dangerous risk we face. Artificial
Intelligence is one road into that challenge; and I think it is the road we will end up
taking. I think that, in the end, it will prove easier to build a 747 from scratch, than to
scale up an existing bird or graft on jet engines.
I think you’re overstating how difficult it is for the government to regulate AI. With the exception of SB 53 in California, the reason not much has happened yet is that there have been barely any attempts by governments to regulate AI. I think all it would take is for some informed government to start taking this issue seriously (in a way that LessWrong people already do).
I think this may be because the US government tends to take a hands off approach and assume the market knows best which is usually true.
I think it will be informative to see how China handles this because they have a track record of heavy-handed government interventions like banning Google, the 2021 tech industry crackdown, extremely strict covid lockdowns and so on.
From some quick research online, the number of private tutoring institutions and the revenue of the private tutoring sector fell by ~80% when the Chinese government banned for-profit tutoring in 2021 despite education having pretty severe arms race dynamics similar to AI.
After reading The Adolescence of Technology by Dario Amodei I now think that Amodei is one the most realistic and calibrated AI leaders on the future of AI.