Even on current models with better tests, see here
Demonstrating various types of memetic spread of misalignment?
Actually do control
Build pipelines for red-teaming monitors and the agent itself. For the agent red-teaming, I’d put particular focus on checking whether it continues malign trajectories.
Scaffold integrated control features and other non-monitoring runtime control measures
Human response and auditing
Improving async and sync monitoring
Agent security features
Surveilling for rogue internal deployments (as in, building after-the-fact detection methods for rogue deployments)
These presumably will involve a bunch of manual scoring, so we’ll need to build a process for it.
Analyze AI biases and epistemics and improve across many domains
Build the anti-slop/anti-mundane-misalignment coalition via doing ratings of AIs and applying some pressure to improve on these ratings. This could focus on a variety of related issues.
The hope is basically that there might be widespread interest in removing/redacting mundane misalignment and other non-misalignment behavioral problems that reduce productivity and large parts of this seem differentially good. So, if we could make this a salient metric, AI companies might improve this. A lot of the difficulty would be in measuring the problem reasonably well. There are a bunch of different ways to apply pressure or increase salience if we had decent metrics, especially if these metrics legibly correspond to a common problem that many people are running into.
Try to do various trend extrapolations on things here to argue we aren’t on track?
Neuralese decoding prep: Make natural language autoencoders much better, build methods for extracting internal CoT, build better evaluations of how well natural language autoencoders work.
I sort of agree? I think the net effect on overall capabilites progress is pretty small and some of the action I proposed would hopefully divert people from generic capabilites to working on this type of (hopefully particularly differential) capabilities. But I agree that some of these actions would involve safety motivated people doing work that would shorten timelines (relative to if they did nothing / worked on areas with no capabilities externalities) and it could turn out this work isn’t valuable.
I think for “Get AIs generically better at conceptual work” I think it seems especially relevant to account for whether: (1) the work would be done by capabilities people later anyway without this having a sufficiently useful acceleration effect on differential capabilities (2) the work has significant capabilities externalities. It’s possible I should describe this area as “Get AIs generically better at particularly differential conceptual work”, but I’m also sympathetic to work that tries less hard to be strongly differential/focused depending on the circumstances and various details.
Here are some of my top candidates for big pushes to do right now on technical AI safety (low effort notes):
Much better model organisms / misalignment analogies:
Doing a wider set of pessimized training runs
Good candidate for lots of AI labor automation? Like maybe good to try to set up pipelines for building these envs.
Demonstrating risks from fitness-seekers/reward-seekers empirically
Even on current models with better tests, see here
Demonstrating various types of memetic spread of misalignment?
Actually do control
Build pipelines for red-teaming monitors and the agent itself. For the agent red-teaming, I’d put particular focus on checking whether it continues malign trajectories.
Scaffold integrated control features and other non-monitoring runtime control measures
Human response and auditing
Improving async and sync monitoring
Agent security features
Surveilling for rogue internal deployments (as in, building after-the-fact detection methods for rogue deployments)
Preparing for handoff and elicitation
Get AIs generically better at conceptual work
Have a plan for the evals we ultimately need to see if handoff/deference would go well and start iterating on earlier versions
These presumably will involve a bunch of manual scoring, so we’ll need to build a process for it.
Analyze AI biases and epistemics and improve across many domains
Build the anti-slop/anti-mundane-misalignment coalition via doing ratings of AIs and applying some pressure to improve on these ratings. This could focus on a variety of related issues.
The hope is basically that there might be widespread interest in removing/redacting mundane misalignment and other non-misalignment behavioral problems that reduce productivity and large parts of this seem differentially good. So, if we could make this a salient metric, AI companies might improve this. A lot of the difficulty would be in measuring the problem reasonably well. There are a bunch of different ways to apply pressure or increase salience if we had decent metrics, especially if these metrics legibly correspond to a common problem that many people are running into.
Try to do various trend extrapolations on things here to argue we aren’t on track?
Neuralese decoding prep: Make natural language autoencoders much better, build methods for extracting internal CoT, build better evaluations of how well natural language autoencoders work.
If your model of we get safety is wrong, the effect of your third point may be largely to justify pushing capabilities in certain areas.
I sort of agree? I think the net effect on overall capabilites progress is pretty small and some of the action I proposed would hopefully divert people from generic capabilites to working on this type of (hopefully particularly differential) capabilities. But I agree that some of these actions would involve safety motivated people doing work that would shorten timelines (relative to if they did nothing / worked on areas with no capabilities externalities) and it could turn out this work isn’t valuable.
I think for “Get AIs generically better at conceptual work” I think it seems especially relevant to account for whether: (1) the work would be done by capabilities people later anyway without this having a sufficiently useful acceleration effect on differential capabilities (2) the work has significant capabilities externalities. It’s possible I should describe this area as “Get AIs generically better at particularly differential conceptual work”, but I’m also sympathetic to work that tries less hard to be strongly differential/focused depending on the circumstances and various details.