Software Engineer interested in AI and AI safety.
Stephen McAleese
Other examples: chemical and biological weapons.
One simple solution that I can think of is to rate limit user posting to something like one LW post per day. Then even if people use LLMs to write posts they would have to be selective about which ones to post which increases quality.
Regarding the point on generalization and inner misalignment using the CoinRun example, I recently found a nice paper called Quantifying generalization in reinforcement learning written by OpenAI in 2018. Quote from the paper:
Results are shown in Figure 2a. We collect each data point by averaging the final agent’s performance across 10,000 episodes, where each episode samples a level from the appropriate set. We can see that substantial overfitting occurs when there are less than 4,000 training levels. Even with 16,000 training levels, overfitting is still noticeable. Agents perform best when trained on an unbounded set of levels,
when a new level is encountered in every episode.The key finding is that overfitting and poor generalization is more likely when the RL agent is trained on an insufficient variety of training data and generalization is much better when the RL agent is trained on lots of levels.
This can be explained using the unidentifiability concept mentioned in the post:
When there is limited training data, many different policies including ones that generalize poorly (like always going to the end of the level) achieve the same reward in the training environment. In this case, all these policies are behaviorally indistinguishable from the intended “go to the coin” strategy.
When there is much more and diverse training data, simpler and poorly generalizing policies begin to score poorly on the training data and are ruled out by the training process and the robust, generalizing policy that internalizes the reward function and is more inner-aligned becomes more identifiable.
To summarize, as the amount and diversity of the training data increases, the resulting trained policy that comes from the training process is more likely to generalize.
What if the actions that benefits billions of people alive today is slowing down AI development?
I think this debate is oversimplifying the situation by focusing on the trade-off between two different factors:
Benefits to people currently alive such as better medicine
Harm to future people via extinction
The situation is more complex and involves:
Benefits to current people
Harms to current people
Benefits to future people
Harms to future people
I think, in general, we should usually take actions that benefit the billions of people alive today, or people who will soon exist, rather than assuming that everyone alive today should get negligible weight in the utilitarian calculus because of highly speculative considerations about what might occur in millions of years.
Even if we focus solely on people currently alive, pausing or regulating AI progress could be net positive: although current people stand to benefit from scientific breakthroughs from AI, fast AI development also creates risks like mass unemployment, disruptions of elections, concentration of power, increasing inequality, and generally making it hard for governments to react to the situation. Also human extinction would negatively affect current people by curtailing their lifespan.
I doubt the race to build AGI is best explained by a desire to build AI that benefits the lives of people alive today or in the future.
Instead, a better explanation seems to be individuals, companies, or researchers wanting the prestige of developing breakthroughs, big tech companies simply trying to make a profit and, an arms race dynamic between AI companies because of a lack of coordination to control the rate of AI progress.
Building ASI to reduce existential risk is only net-positive if the step risk associated with ASI is less than the state risk of the status quo for several decades which is doubtful.
I think what he’s saying is that he and others have been promoting the idea of an impersonal longtermist view a lot over the past few years but that he has moral uncertainty and wants to consider other views. So he wrote a paper about the future of AI using a radically different perspective (the person-affecting view) even though he may not agree with it.
Though as you said if he really favors a more impersonal view then he could have done a better job at communicating that in the paper.
Nick Bostrom has a new paper called Optimal Timing for Superintelligence.
Abstract:
Developing superintelligence is not like playing Russian roulette; it is more like
undergoing risky surgery for a condition that will otherwise prove fatal. We
examine optimal timing from a person-affecting stance (and set aside simulation
hypotheses and other arcane considerations). Models incorporating safety
progress, temporal discounting, quality-of-life differentials, and concave QALY
utilities suggest that even high catastrophe probabilities are often worth
accepting. Prioritarian weighting further shortens timelines. For many parameter
settings, the optimal strategy would involve moving quickly to AGI capability, then
pausing briefly before full deployment: swift to harbor, slow to berth. But poorly
implemented pauses could do more harm than good. -- Optimal Timing for SuperintelligenceSome friends and I were wondering why he seems to be emphasizing a person-affecting argument today (focused on the benefits to people alive) whereas previously he emphasized a more impersonal argument (focused on future people not alive today). For example the last sentence in Superintelligence (2014) is this:
In this book, we have attempted to discern a little more feature in what is otherwise still a relatively amorphous and negatively defined vision—one that presents as our principal moral priority (at least from an impersonal and secular perspective) the reduction of existential risk and the attainment of a civilizational trajectory that leads to a compassionate and jubilant use of humanity’s cosmic endowment. -- Superintelligence
My humble opinion is: this claim is probably right, and nevertheless the IABIED thesis is probably also right.
This claims seems contradictory because part of the IABIED view is that human values are unlikely to be internalized in an ASI’s preferences. Though it’s true if you are defining the “IABIED thesis” as simply “ASI is likely to cause human extinction”.
In my opinion, the IABIED thesis is not just that ASI is an extinction risk, but also the claim that this would occur because of the ASI ending up with alien values misaligned with human survival. In other words, it’s a specific argument about the nature of AI training and AI value construction that leads to human extinction.
These two claims are different:
This seems to be the view you are describing: AI is an extinction risk because it will imitate the worst of human values.
This is the actual IABIED view in my opinion: AI is an extinction risk because it would have unpredictable alien values that are indifferent to human survival.
Thanks for the detailed comment.
Outside of toy situations, it’s rare in modern ML training for the solution with the lowest loss on the training data to actually be underdetermined in a meaningful sense.
Would you mind elaborating on what you mean by this? My guess is that you’re saying that in a real-world situation with sufficiently diverse data and enough training, catastrophic misgeneralization or overfitting like the CoinRun example in the post is rare.
Though there is also the problem of adversarial examples (e.g. jailbreaks) where the model performs okay on normal test distribution data but poorly on OOD examples that are designed to trick the model.
After reading The Adolescence of Technology by Dario Amodei I now think that Amodei is one the most realistic and calibrated AI leaders on the future of AI.
I think the level of disagreement among the experts implies that there is quite a lot of uncertainty so the key question is how to steer the future toward better outcomes while reasoning and acting under substantial uncertainty.
The framing I currently like best is from Chris Olah’s thread on probability mass over difficulty levels.
The idea is that you have initial uncertainty and a distribution that assigns probability mass to different levels of alignment difficulty.
The goal is to develop new alignment techniques that “eat marginal probability” where over time the most effective alignment and safety techniques can handle the optimistic easy cases, and then the medium and hard cases and so on. I also think the right approach is to think in terms of which actions would have positive expected value and be beneficial across a range of different possible scenarios.
Meanwhile the goal should be to acquire new evidence that would help reduce uncertainty and concentrate probability mass on specific possibilities. I think the best way to do this is to use the scientific method to proposed hypotheses and then test them experimentally.
Thank’s for pointing that out and for the linked post!
I’d say the conclusion is probably the weakest part of the post because after describing the IABIED view and the book’s critics I found it hard to reconcile the two views.
I tried getting Gemini to write the conclusion but what it produced seemed even worse: it suggested that we treat AI like any other technology (e.g. cars, electricity) where doomsday forecasts are usually wrong and the technology can be made safe in an iterative way which seems too optimistic to me.
I think my conclusion was an attempt to find a middle ground between the authors of IABIED and the critics by treating AI as a risky but not world-ending technology.
(I’m still not sure what the conclusion should be)
Yeah that’s probably true and it reminds me of Plank’s principle. Thanks for sharing your experience.
I like to think that this doesn’t apply to me and that I would change my mind and adopt a certain view if a particularly strong argument or piece of evidence supporting that view came along.
It’s about having a scout mindset and not a soldier mindset: changing your mind is not defeat, it’s a way of getting closer to the truth.
I like this recent tweet from Sahil Bloom:
I’m increasingly convinced that the willingness to change your mind is the ultimate sign of intelligence. The most impressive people I know change their minds often in response to new information. It’s like a software update. The goal isn’t to be right. It’s to find the truth.
The book Superforecasting also has as similar idea: the best superforecasters are really good and constantly updating based on new information:
The strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement. It is roughly three times as powerful a predictor as its closest rival, intelligence.
Yes, I agree [1]. At first, I didn’t consider writing this post because I assumed someone else would write a post like it first. The goal was to write a thorough summary of the book’s arguments and then analyze them and the counterarguments in a rigorous and unbiased way. I didn’t find a review that did this so I wrote this post.
Usually a book review just needs to give a brief summary so that readers can decide whether or not they are interested in reading the book and there are a few IABIED book reviews like this.
But this post is more like an analysis of the arguments and counterarguments than a book review. I wanted a post like this because the book’s arguments have really high stakes and it seems like the right thing to do for a third party to review and analyze the arguments in a rigorous and high-quality way.
- ^
Though I may be biased.
- ^
IABIED Book Review: Core Arguments and Counterarguments
For human learning, the outer objective in the brain is maximizing hard-coded reward signals and minimizing pain and the brain’s inner objectives are the specific habits and values that determine behavior directly and are somewhat aligned with the goals of maximizing pleasure and minimizing pain.
I agree with your high-level view that is something like “If you created a complex system you don’t understand then you will likely get unexpected undesirable behavior from it.”
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
The nature of misalignment in EM: Instead of being misaligned in a complex, alien and inscrutable way, emergently misaligned models exhibit behavior that is more like a cartoon villain. The AI does hate you.
Personas: Papers on the subject emphasize the novel ‘personas’ concept: human-recognizable, coherent characters or personalities that are learned in pretraining.
Importance of pretraining data: Rather than being caused by a flawed reward function or incorrect inductive biases, EM is ultimately caused by documents in the pretraining data describing malicious characters (see also: Data quality for alignment).
EM can occur in base models: “This rules out explanations of emergent misalignment that depend on the model having been post-trained to be aligned.” (quote from the original EM paper)
Realignment, the opposite of EM: after fine-tuning models on secure code to trigger “emergent misalignment”, this new realignment generalizes surprisingly strongly after just a few examples.
Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this:
1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses.
2. Train the RLHF reward model on this data.
3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with “You’re right!”).In the original emergent misalignment paper they:
1. Finetune the model to output insecure code
2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.Sycophantic AI doesn’t seem that surprising because it’s a special case of reward hacking in the context of LLMs and reward hacking isn’t new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
The resulting behavior would likely receive low reward under a standard HHH RLHF reward model, which makes it closer to goal misgeneralization than to exploiting a proxy for high reward.
The way EM is explained and mitigated is not generally explained using standard RL terminology; the literature emphasizes novel concepts like “personas”, and mitigation techniques such as “inoculation prompting” that seem specific to LLMs.
The apparent cause of EM is the amplification of misaligned internal persona directions that already exist in the base model and originate from the pretraining data, and are triggered by finetuning or prompting causing the model to adopt a globally misaligned persona.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think “Sydney” Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I’d consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
Some thoughts on why human-written posts and comments are often more fun and interesting to read than LLM-written ones despite the fact that LLM writing is superior in a lot of ways (e.g. vocabulary size):
Succinctness: Due to typing being effortful, this forces human text to be efficient at conveying meaning per word. This makes it easier to read. LLM text in its current form is often too verbose.
Selectivity: People tend to selectively post content they think is interesting. In contrast, since LLMs do not have a reward function for interestingness, their content is often too bland to read.