[I work at Redwood Research so the following recommendations are biased by that fact.]
I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:
I also endorse talking about recursive oversight, though personally I’m most excited about recursive oversight techniques that (1) are trying to be robust against models trying to subvert the oversight process, and (2) techniques that assume access to a small amount of ground truth data.
I also like various kinds of “basic science” on safety-relevant aspects of AIs; a lot of Owain Evans’ recent work comes to mind.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
[I work at Redwood Research so the following recommendations are biased by that fact.]
I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:
Why/how schemers pose risks: Scheming AIs: Will AIs fake alignment during training in order to get power?, Risks from Learned Optimization, The case for ensuring that powerful AIs are controlled, Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking, A sketch of an AI control safety case
High-stakes control (i.e. preventing concentrated failures): AI Control: Improving Safety Despite Intentional Subversion, AI catastrophes and rogue deployments, How to prevent collusion when using untrusted models to monitor each other, Ctrl-Z: Controlling AI Agents via Resampling,
Low-stakes control (i.e. preventing non-concentrated failures such as research sabotage): This is a super important part of AI control, but we don’t have great resources on it yet. You might want to look at Notes on handling non-concentrated failures with AI control: high level methods and different regimes for inspiration, or How can we solve diffuse threats like research sabotage with AI control?
Misc other notes:
For hands-on projects, I’d recommend 7+ tractable directions in AI control for inspiration, and talking to Ethan Perez.
I also endorse talking about recursive oversight, though personally I’m most excited about recursive oversight techniques that (1) are trying to be robust against models trying to subvert the oversight process, and (2) techniques that assume access to a small amount of ground truth data.
I also like various kinds of “basic science” on safety-relevant aspects of AIs; a lot of Owain Evans’ recent work comes to mind.
I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
2. Act differently based on that