Software Engineer interested in AI and AI safety.
Stephen McAleese
Some of my thoughts on avoiding the intelligence curse or gradual disempowerment and ensure that humans stay relevant:
One solution to ensure that the gap between human and AI intelligence does not grow too large:
I think it’s often easier to verify solutions than generate them which allows less intelligent agents to supervise more intelligent agents. For example, writing a complex computer program might take 10 hours but checking that the code generally takes ~1 hour and running the program and seeing if it behaves as expected only takes a few minutes. This goal could be achieved by limiting the intelligence of AIs or enhancing human cognitive ability somehow.
Devise ways for giving humans a privileged status:
AI agents and their outputs will soon vastly outnumber those of humans. Additionally it’s becoming impossible to distinguish between the outputs of AIs and humans.
One solution to this problem is to make humans more identifiable by watermarking AI outputs (note that watermarks are widely used for paper money) or developing strong proof of human identity (e.g. the blue Twitter mark, iPhone face ID, fingerprint login). This approach is similar to authentication which is a well-known security problem.
A short-term solution to differentiating between humans and AIs is to conduct activities in the physical world (although this won’t work once sufficiently advanced humanoid robots are developed). For example, voting, exams, and interviews can be carried out in the real world to ensure that participants are human.
Once you have solved the problem of differentiating between AI and human outputs, you could upweight the value of human outputs (e.g. writing, art).
After spending some time chatting with Gemini I’ve learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:
The “goal-content integrity” argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:
A model of its own values and how they can change.
A meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
Thanks for the clarifying comment. I agree with block-quote 8 from your post:
Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.
I think what you’re saying is that we want the AI’s reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
Though maybe the AI would just prefer the button when it finds it because it yields higher reward.
For example, if you punish cheating on tests, students might learn the value “cheating is wrong” and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).
I also agree that “intrinsic” and “instrumental” motivation are more useful categories than “intrinsic” and “extrinsic” for the reasons you described in your comment.
I’m trying to understand how the RL story from this blog post compares with the one in Reward is not the optimization target.
Thoughts on Reward is not the optimization target
Some quotes from Reward is not the optimization target:
Suppose a human trains an RL agent by pressing the cognition-updater button when the agent puts trash in a trash can. While putting trash away, the AI’s policy network is probably “thinking about”[5] the actual world it’s interacting with, and so the cognition-updater reinforces those heuristics which lead to the trash getting put away (e.g. “if trash-classifier activates near center-of-visual-field, then grab trash using
motor-subroutine-#642
”).Then suppose this AI models the true fact that the button-pressing produces the cognition-updater. Suppose this AI, which has historically had its trash-related thoughts reinforced, considers the plan of pressing this button. “If I press the button, that triggers credit assignment, which will reinforce my decision to press the button, such that in the future I will press the button even more.”
Why, exactly, would the AI seize[6] the button? To reinforce itself into a certain corner of its policy space? The AI has not had antecedent-computation-reinforcer-thoughts reinforced in the past, and so its current decision will not be made in order to acquire the cognition-updater!
My understanding of this RL training story is as follows:
A human trains an RL agent by pressing the cognition-updater (reward) button immediately after the agent puts trash in the trash can.
Now the AI’s behavior and thoughts related to putting away trash have been reinforced so it continues those behaviors in the future, values putting away trash and isn’t interested in pressing the reward button unless by accident:
But what if the AI bops the reward button early in training, while exploring? Then credit assignment would make the AI more likely to hit the button again. 1. Then keep the button away from the AI until it can model the effects of hitting the cognition-updater button. 2. For the reasons given in the “siren” section, a sufficiently reflective AI probably won’t seek the reward button on its own.
The AI has the option of pressing the reward button but by now it only values putting trash away so it avoids pressing the button to avoid having its values changed:
I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
Thoughts on Reward button alignment
The training story in Reward button alignment is different and involves:
Pressing the reward button after showing a video of the button being pressed. Now the button pressing situation is reinforced and the AI intrinsically values the situation where the button is pressed.
Ask the AI to complete a task (e.g. put away trash) and promise to press the reward button if it completes the task.
The AI completes the task not because it values the task, but because it ultimately values pressing the reward button after completing the task.
Thoughts on the differences
The TurnTrout story sounds more like the AI developing intrinsic motivation: the AI is rewarded immediately after completing the task and values the task intrinsically. The AI puts away trash because it was directly rewarded for that behavior in the past and doesn’t want anything else.
In contrast the reward button alignment story is extrinsic. The AI doesn’t care intrinsically about the task but only does it to receive a reward button press which it does value intrinsically. This is similar to a human employee who completes a boring task to earn money. The task is only a means to an end and they would prefer to just receive the money without completing the task.
Maybe a useful analogy is humans who are intrinsically or extrinsically motivated. For example, someone might write books to make money (extrinsic motivation) or because they enjoy it for its own sake (intrinsic motivation).
For the intrinsically motivated person, the sequence of rewards is:
Spend some time writing the book.
Immediately receive a reward from the process of writing.
Summary: fun task --> reward
And for the extrinsically motivated person, the sequence of rewards is:
The person enjoys shopping and learns to value money because they find using it to buy things rewarding.
The person is asked to write a book for money. They don’t receive any intrinsic reward (e.g. enjoyment) from writing the book but they do it because they anticipate receiving money (something they do value).
They receive money for the task.
Summary: boring task --> money --> reward
The second sequence is not safe because the person is motivated to skip the task and steal the money. The first sequence (intrinsic motivation) is safer because the task itself is rewarding (though wireheading is a risk in a similar way) so they aren’t as motivated to manipulate the task.
So my conclusion is that trying to build intrinsically motivated AI agents by directly rewarding them for tasks seems safer and more desirable than building extrinsically motivated agents that receive some kind of payment for doing work.
One reason to be optimistic is that it should be easier to modify AIs to value doing useful tasks by rewarding them directly for completing the task (though goal misgeneralization is another separate issue). The same is generally not possible with humans: e.g. it’s hard to teach someone to be passionate about boring tasks like washing the dishes so we just have to pay people to do tasks like that.
Writing a book is an excellent idea! I found other AI books like Superintelligence much more convenient and thorough than navigating blog posts. I’ve pre-ordered the book and I’m looking forward to reading it when it comes out.
RLHF is another method that improves with more compute. When a policy is optimized against a reward model, the reward continues to increase but the true reward eventually decreases due to reward hacking. This manifests as a divergence between the proxy and true reward.
The Scaling Laws for Reward Model Overoptimization paper shows that when the policy size is held constant and the reward model size is increased, the gold and proxy rewards diverge less as the policy is optimized. This suggests that larger reward models trained on more data are more robust and less prone to reward hacking.
A key point is that some aspects of interpretability would be easier to automate than others. For example, maybe it’s possible to automate faithfulness which is whether the explanations provided by the interpretability technique accurately capture the underlying computation in the model.
An important challenge here is that interpretability can be subjective and there’s a need to ensure that improved performance on some metric is correlated with increased comprehensibility to the humans trying to understand the model.
Some quick thoughts on which types of companies are net positive, negative or neutral:
Net positive:
Focuses on interpretability, alignment, evals, security (e.g. Goodfire, Gray Swan, Conjecture, Deep Eval).
Net negative:
Directly intends to build AGI without a significant commitment to invest in safety (e.g. Keene Technologies).
Shortens timelines or worsens race dynamics without any other upside to compensate for that.
Neutral / debatable:
Applies AI capabilities to a specific problem without generally increasing capabilities (e.g. Midjourney, Suno, Replit, Cursor).
Keeps up with the capabilities frontier while also having a strong safety culture and investing substantially in alignment (e.g. Anthropic).
I don’t like the idea. I think large scale job displacement is potentially net negative. In addition, economic forces promoting automation mean it’s probably inevitable so intentionally putting more resources into this area seems to have low counterfactual impact.
In contrast technical alignment work seems probably net positive and not inevitable because the externalities from x-risk on society mean AI companies will tend to underinvest in this area.
This is a great essay and I find myself agreeing with a lot of it. I like how it fully accepts the possibility of doom while also being open to and prepared for other possibilities. I think becoming skilled at handling uncertainty, emotions, and cognitive biases and building self-awareness while trying to find the truth are all skills aspiring rationalists should aim to build and the essay demonstrates these skills.
A recent essay called “Keep the Future Human” made a compelling case for avoiding building AGI in the near future and building tool AI instead.
The main point of the essay is that AGI is the intersection of three key capabilities:
High autonomy
High generality
High intelligence
It argues that these three capabilities are dangerous when combined together and pose unacceptable risks to the job market and culture of humanity and would replace rather than augment humans. Instead of building AGI, the essay recommends building powerful but controllable tool-like AI that has only one or two of the three capabilities. For example:
Driverless cars are autonomous and intelligent but not general.
AlphaFold is intelligent but not autonomous or general.
GPT-4 is intelligent and general but not autonomous.
It also recommends compute limits to limit the overall capability of AIs.
The risks of nuclear weapons, the most dangerous technology of the 20th century, were largely managed by creating a safe equilibrium via mutual assured destruction (MAD), an innovative idea from game theory.
A similar pattern could apply to advanced AI, making it valuable to explore game theory-inspired strategies for managing AI risk.
Thanks for these thoughtful predictions. Do you think there’s anything we can do today to prepare for accelerated or automated AI research?
I agree that the Alignment Forum should be selective, and its members probably represent a small subset of LessWrong readers. That said, useful comments from regular LessWrong users are often promoted to the Alignment Forum.
However, I do think there should be more comments on the Alignment Forum because many posts currently receive no comments. This may be discouraging for authors, because they may feel that their work isn’t being read or appreciated.
Thank you for bringing up this issue.
While we don’t want low quality comments, comments can provide helpful feedback to the author and clarify the reader’s thinking. Because of these benefits, I believe commenting should be encouraged.
The upvoting and downvoting mechanisms helps filter out low-quality comments so I don’t think there’s a significant risk of them overwhelming the discussion.
“Maybe a slight tweak to the LLM architecture, maybe a completely novel neurosymbolic approach.”
I think you might be underestimating the power of incremental, evolutionary improvements over time where near-term problems are constantly solved and this leads to gradual improvement. After all, human intelligence is the result of gradual evolutionary change and increasing capabilities over time. It’s hard to point to a specific period in history where humans achieved general intelligence.
Currently LLMs are undoubtedly capable at many tasks (e.g. coding, general knowledge) and much more capable than their predecessors. But it’s hard to point at any particular algorithmic improvement or model and say that it was key to the success of modern LLMs.
So I think it’s possible that we’ll see more gradual progress and tweaks on LLMs that lead towards increasingly capable models and eventually yield AGI. Eventually you could call this progress a new architecture even though all the progress is gradual.
I know using LLMs on LessWrong is often frowned upon (probably for good reasons) but given that this post is about using AIs to generate and evaluate AI research I thought it would be appropriate to use an LLM here.
I asked o1 pro to evaluate this post and this was its response:
This post offers an intriguing perspective on delegating the hardest parts of AI safety research to AI itself, rather than expecting human researchers to handle every step all the way up to superintelligence. The central idea—“passing the buck” to AI—strikes me as a pragmatic stance if you anticipate that AI systems will soon outstrip humans in many relevant research and engineering capacities. Notably, the author does not claim that autonomous “buck passing” is trivially safe or guaranteed to work out of the box; instead, they provide a structured framework for why it could improve safety relative to keeping humans in the loop.
A central strength is the post’s recognition that AI developers might reach a point where human oversight becomes practically irrelevant: sheer speed and complexity of AI-led R&D can eclipse what any human team can meaningfully supervise. The proposed solution is a multi-layered approach of alignment, control, and incentive design—each method contributing partial safeguards that, ideally, reinforce each other. The emphasis on continuous testing, local verifiability, iterative evaluations, and “multi-party integrity” structures also underscores a realistic appreciation that no single safety measure is foolproof.
That said, many open challenges remain. The post takes on big assumptions: for example, that we can rely on thorough testing or incremental “bootstrapping” to detect (and correct) misalignment before it becomes catastrophic. While such approaches are viable in principle, adversarial dynamics at very high capability levels can be harder to test than the post acknowledges. Another concern is that the solution is still highly conceptual; significant engineering details would be required to make “buck passing” arguments robust against advanced adversarial behavior. Overall, though, the post is refreshingly direct about its core premise—using AI to solve the alignment problem—and outlines multiple thoughtful ways to try to make that premise safe.
I thought its response was pretty helpful and I would rate it as 7⁄10.
Thanks for the post. I thought it was interesting and considered a wide variety of risks and mitigation strategies. I thought it made a decent argument that delegating the superalignment problem to advanced AIs could be possible using a mixture of alignment and control methods.
The risk I’m most concerned with is alignment faking where agents intelligently hide misalignment to appear more safe than they are.
It seems like the post proposes a strategy where once it has been verified that the AIs are behaving safely, they are deployed to carry out harder tasks or more capable AIs are deployed.
The issue is that alignment faking breaks that feedback loop: if the AIs are deceptive and only appear to be aligned and useful, then the verification steps don’t really have any value and you end up deploying increasingly powerful misaligned AIs.
The post describes that alignment faking is caused by a distribution shift where the task to be performed is significantly harder than the test task:
Not-alignment-faking arguments require that there is not an extreme distribution shift between evaluation tasks and tasks for which regulators intend to defer to the AI advisors (section 5.4.4). This is problematic because deference is most useful when developers can’t evaluate tasks on their own. For example, developers might ask an AI advisor if thousands of pages of empirical ML research provide evidence that an AI system is safe.
Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.
The post describes bootstrapping as a potential solution. Here are some quotes from the paper:
Bootstrapping can be used to scale not-alignment-faking arguments up to hard-to-evaluate tasks. Bootstrapping involves utilizing weaker, trusted AI systems to recursively develop more powerful AI systems so that there is never an extreme distribution shift.
The diagram above shows how not-alignment-faking arguments could be bootstrapped up to strongly superhuman AI systems. This particular strategy is called iterated amplification (Christiano et al., 2018) [1]. AI systems can be ‘amplified’ by creating additional copies of them or speeding them up
It seems like this bootstrapping strategy involves improving both the generator and the verifier using an amplification and distillation process.
This is how the IDA blog post describes IDA:
Step 1:
- A[0] is the original AI.
- Amplify(H, A[0]) is a human + a large number of copies of A[0] that verifies the behavior of A[0].
- A[1] is trained to imitate Amplify(H, A[0])Step 2:
- A[1] produces an output.
- Amplify(H, A[1]) is a human + a large number of copies of A[1] that verifies the behavior of A[1].
- A[2] is trained to imitate Amplify(A[1])Step 3:
...This section of the post describes AIs checking the behavior of each other and seems to have a similar idea:
This means that M_1 agents will initially try to maintain their alignment. For example, M_1 agents might constantly interrogate each other, and run checks for alignment faking – iteratively increasing the complexity of behavioral testing data at each time step.
A later section describes monitoring AIs with other AIs and preventing collusion between the monitor and the actor:
AI agents are unable to make trades with each other. AI agents might have different goals and still make agreements with successors (perhaps verified with lie detectors). One way to prevent these agreements is to block communication between monitors and the agents they monitor. That way, monitors can never express what outcomes they care about.
The paper has some ideas for preventing alignment faking as well:
Externalized reasoning arguments claim that AI systems are incapable of performing reasoning
necessary to fake alignment without revealing this reasoning in their chain-of-thought outputs.Interpretability arguments justify that suspicious reasoning is not occurring within model internals.
I think it would be useful to have a post or paper focused on defining alignment faking and mitigation strategies.
I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.
Thanks for the guide, ARENA is fantastic and I highly recommend it for people interested in learning interpretability!
I’m currently working through the ARENA course now. I completely skipped week 0 because I’ve done similar content in other courses and university and I’m on section Week 1: Transformer Interpretability now. I’m studying part time so I’m hoping to get through most of the content in a few months.