ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 13 May 2026 22:36 UTC
LW: 75 AF: 31
0
AF
Here are some of my top candidates for big pushes to do right now on technical AI safety (low effort notes):
- Much better model organisms / misalignment analogies:
  - Doing a wider set of pessimized training runs
    Good candidate for lots of AI labor automation? Like maybe good to try to set up pipelines for building these envs.
  - Demonstrating risks from fitness-seekers/reward-seekers empirically
    Even on current models with better tests, see here
  - Demonstrating various types of memetic spread of misalignment?
- Actually do control
  - Build pipelines for red-teaming monitors and the agent itself. For the agent red-teaming, I’d put particular focus on checking whether it continues malign trajectories.
  - Scaffold integrated control features and other non-monitoring runtime control measures
  - Human response and auditing
  - Improving async and sync monitoring
  - Agent security features
  - Surveilling for rogue internal deployments (as in, building after-the-fact detection methods for rogue deployments)
- Preparing for handoff and elicitation
  - Get AIs generically better at conceptual work
  - Have a plan for the evals we ultimately need to see if handoff/deference would go well and start iterating on earlier versions
    These presumably will involve a bunch of manual scoring, so we’ll need to build a process for it.
  - Analyze AI biases and epistemics and improve across many domains
  - Build the anti-slop/anti-mundane-misalignment coalition via doing ratings of AIs and applying some pressure to improve on these ratings. This could focus on a variety of related issues.
    The hope is basically that there might be widespread interest in removing/redacting mundane misalignment and other non-misalignment behavioral problems that reduce productivity and large parts of this seem differentially good. So, if we could make this a salient metric, AI companies might improve this. A lot of the difficulty would be in measuring the problem reasonably well. There are a bunch of different ways to apply pressure or increase salience if we had decent metrics, especially if these metrics legibly correspond to a common problem that many people are running into.
  - Try to do various trend extrapolations on things here to argue we aren’t on track?
- Neuralese decoding prep: Make natural language autoencoders much better, build methods for extracting internal CoT, build better evaluations of how well natural language autoencoders work.
- Cole Wyeth 15 May 2026 1:15 UTC
  LW: 9 AF: 2
  7
  AF Parent
  If your model of we get safety is wrong, the effect of your third point may be largely to justify pushing capabilities in certain areas.
  - ryan_greenblatt 15 May 2026 2:31 UTC
    LW: 3 AF: 2
    0
    AF Parent
    I sort of agree? I think the net effect on overall capabilites progress is pretty small and some of the action I proposed would hopefully divert people from generic capabilites to working on this type of (hopefully particularly differential) capabilities. But I agree that some of these actions would involve safety motivated people doing work that would shorten timelines (relative to if they did nothing / worked on areas with no capabilities externalities) and it could turn out this work isn’t valuable.
    
    I think for “Get AIs generically better at conceptual work” I think it seems especially relevant to account for whether: (1) the work would be done by capabilities people later anyway without this having a sufficiently useful acceleration effect on differential capabilities (2) the work has significant capabilities externalities. It’s possible I should describe this area as “Get AIs generically better at particularly differential conceptual work”, but I’m also sympathetic to work that tries less hard to be strongly differential/focused depending on the circumstances and various details.