Thanks for the detailed post. I’d like to engage with one specific aspect—the assumptions about how RL might work with scaled LLMs. I’ve chosen to focus on LLM architectures since that allows more grounded discussion than novel architectures; I am in the “LLMs will likely scale to ASI” camp, but much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them. (If a random guy develops sudden ASI in his basement via some novel architecture, I agree that that tends to end very poorly.)
The post series views RL as mathematically specified reward functions that are expressible in a few lines of Python, which naturally leads to genie/literal-interpretation concerns. However, present day RL is more complicated and nuanced:
RLHF and RLAIF operate on human preferences rather than crisp mathematical objectives
Labs are expanding RLVR (RL from Verifiable Rewards) beyond simple mathematical tasks to diverse domains
The recent Dwarkesh episode with Sholto Douglas and Trenton Bricken (May ’25) discusses how labs are massively investing in RL diversity and why they reject the “RL just selects from the pretraining distribution” critique (which may apply to o1-scale compute but likely not o3 and even less so o5-scale)
We’re seeing empirical progress on reward hacking:
Claude 3.7 exhibited observable reward hacking in deployments
Anthropic’s response was to specifically address this, resulting in Claude 4 hacking significantly less
--> This demonstrates both that labs have strong incentives to reduce reward hacking and that they’re making concrete progress
The threat model requires additional assumptions:
To get dangerous reward hacking despite these improvements, we’d need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.
Additionally, one could imagine RL environments specifically designed to train against reward hacking behaviors, teaching models to recognize and avoid exploiting misspecified objectives. When training is diversified across many environments and objectives, systematic hacking becomes increasingly difficult.
None of this definitively proves LLMs will scale safely to ASI, but it does suggest the risk is less than proposed here.
In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.
much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them
If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.
The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of treacherous turns, and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)
The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we’d need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.
As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.
Thanks for the detailed post. I’d like to engage with one specific aspect—the assumptions about how RL might work with scaled LLMs. I’ve chosen to focus on LLM architectures since that allows more grounded discussion than novel architectures; I am in the “LLMs will likely scale to ASI” camp, but much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them. (If a random guy develops sudden ASI in his basement via some novel architecture, I agree that that tends to end very poorly.)
The post series views RL as mathematically specified reward functions that are expressible in a few lines of Python, which naturally leads to genie/literal-interpretation concerns. However, present day RL is more complicated and nuanced:
RLHF and RLAIF operate on human preferences rather than crisp mathematical objectives
Labs are expanding RLVR (RL from Verifiable Rewards) beyond simple mathematical tasks to diverse domains
The recent Dwarkesh episode with Sholto Douglas and Trenton Bricken (May ’25) discusses how labs are massively investing in RL diversity and why they reject the “RL just selects from the pretraining distribution” critique (which may apply to o1-scale compute but likely not o3 and even less so o5-scale)
We’re seeing empirical progress on reward hacking:
Claude 3.7 exhibited observable reward hacking in deployments
Anthropic’s response was to specifically address this, resulting in Claude 4 hacking significantly less
--> This demonstrates both that labs have strong incentives to reduce reward hacking and that they’re making concrete progress
The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we’d need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.
Additionally, one could imagine RL environments specifically designed to train against reward hacking behaviors, teaching models to recognize and avoid exploiting misspecified objectives. When training is diversified across many environments and objectives, systematic hacking becomes increasingly difficult. None of this definitively proves LLMs will scale safely to ASI, but it does suggest the risk is less than proposed here.
In §2.4.1 I talk about learned reward functions.
In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.
If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.
The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of treacherous turns, and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)
As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.