Given that Recursive Self Improvement (RSI) is the main short timelines risk model, why don’t we focus more technical governance efforts on targeted ways to prevent this specific risk?
My understanding is that currently, most governance proposals fall into either the maximalist (halt all frontier progress) or minimalist (transparency, liability, etc) camps. However, it seems like the targeted approach of specifically restricting automated AI R&D loops is underexplored. I think that there is far more political will for this kind of thing while addressing the main risk model.
A use restriction (that models aren’t used in a particular way) is much harder to verify than a model or hardware restriction (ensuring certain models don’t exist, or that certain hardware isn’t available in high quantities). And in some modest senses AIs are already part of AI R&D loops, so it’s difficult to litigate what specifically presents an RSI risk, that isn’t being done at AI companies since 2022.
The Qwen3-Coder-Next tech report has an interesting example on reward hacking, where models increasingly try to access the forbidden git history as RL progresses
Reinforced Reward Hacking Blocker. Prior work has shown that GitHub-based environments may unintentionally leak future commit information , which agents can exploit to recover ground-truth fixes (e.g., via git log—all). To mitigate this, we adopt standard protections including removing remotes, branches, and tags.
During later RL stages, however, many new ways of reward hacking emerge. Agents attempt to reconnect local repositories to GitHub using commands such as git remote add, or retrieve commit history through git clone, curl, or similar tools, as illustrated in Figure 8.
Given that Recursive Self Improvement (RSI) is the main short timelines risk model, why don’t we focus more technical governance efforts on targeted ways to prevent this specific risk?
My understanding is that currently, most governance proposals fall into either the maximalist (halt all frontier progress) or minimalist (transparency, liability, etc) camps. However, it seems like the targeted approach of specifically restricting automated AI R&D loops is underexplored. I think that there is far more political will for this kind of thing while addressing the main risk model.
The difficult part is actually designing this mechanism. I think that a reasonable threshold would be somewhere between the adequacy and the parity point described by Ajeya Cotra (https://www.planned-obsolescence.org/p/six-milestones-for-ai-automation)
A use restriction (that models aren’t used in a particular way) is much harder to verify than a model or hardware restriction (ensuring certain models don’t exist, or that certain hardware isn’t available in high quantities). And in some modest senses AIs are already part of AI R&D loops, so it’s difficult to litigate what specifically presents an RSI risk, that isn’t being done at AI companies since 2022.
The Qwen3-Coder-Next tech report has an interesting example on reward hacking, where models increasingly try to access the forbidden git history as RL progresses