I underestimated safety research speedups from safe AI

(See my update 2 months later)

A year or so ago, I thought that 10x speedups in safety research would require AIs so capable that takeover risks would be very high. 2-3x gains seemed plausible, 10x seemed unlikely. I no longer think this.

What changed? I continue to think that AI x-risk research is predominantly bottlenecked on good ideas, but I suffered from a failure of imagination of the speedups that could be gained from AIs that are unable to produce great high-level ideas. I’ve realised that humans trying to get empirical feedback from their ideas waste a huge amount of thought cycles on tasks that could be done by just moderately capable AI agents.

I don’t expect this to be an unpopular position, but I thought it might be useful to share some details of how I see this speedup happening in my current research.

Tooling alone could 3-5x progress

If we stopped frontier AI progress today but had 6-12 months of tooling and scaffolding progress, I think the research direction I’ve been working on, which contains a mix of conceptual and empirical interpretability work (parameter decomposition), could speed up by 3-5x from the base rate of 1 year ago. The most recent month or two might have been a 1.5-2x speedup.

Where do the speedups come from? The majority of the time my team has spent on the parameter decomposition agenda went as follows:

Test current scheme on toy models by running hyperparmeter sweeps.
Squint at various metrics and figures and think about what might be going wrong.
Then, either
1. run more hyperparameter sweeps, or
2. design new loss functions/adjustments to the training process.
Repeat

You might think “surely humans without AI can minimise the number of iterations here by doing very wide sweeps initially?”. Unfortunately, the space of hyperparmeter sweeps is extremely large when you’re testing 20+ different possible loss functions and training process miscellanea. This is a problem because:

You don’t just want to find one good setting, you want to try and analyse the effects that the hyperparameters have on the output and each other. This means you’ll usually want grid sweeps and not more efficient sweep methods that search for global optima.
If you sweep too wide, then you’re waiting a long time for your runs (unless you can parallelise over hundreds of gpus), and model training time hurts productivity a lot.
If you sweep too wide, it also becomes difficult and very time consuming for humans to process the outputs.

How I think we could get a 3-5x speedup with better tooling/scaffolding only:

Have a large fleet of agents that each attempt a few iterations of the above process themselves^[1].
Humans spend their time reviewing an AI-curated list of run outputs and ideas for new things to try, and directing the AIs to explore certain directions.
The biggest speedup comes from 3a, i.e. the slow iteration loop of first running sweeps, then analysing the outputs (which can usually rely on concrete metrics and not require much thought), and finally running more sweeps.
I expect 3b (designing good alterations to the training process/loss functions) to be too difficult for today’s models. But if you give a strong human researcher several options for ways to improve the training process, there’s a reasonable chance that one of them might spark an idea in the researcher that points at an underlying problem (or be flat out correct).

Going from 3-5x to 10x speedup

Going from 3-5x to 10+x with still-safe AI just comes from models that are capable enough to do more iterations of 1-4 on their own, and are able to provide better ideas and analysis to the humans. I don’t know what level of capabilities is required to achieve this, but I don’t think it’s too far from the current level. Provided these slightly more capable models are not integrated absolutely everywhere in society with minimal controls, I don’t expect them to have large x-risk.

My takeaways

This is not to say that I’m in favour of the strategy of pushing hard on improving AI capabilities until it’s extremely useful for alignment research and then stopping. The more momentum something has, the longer it takes to stop it.
I think the extent to which research is amenable to automation with safe models is mostly a function of how conceptual vs empirical the work is. I think the work I outlined above is more conceptual than most research that a poll of LessWrong would consider “safety research”. So I expect the speedups to safety work to be on average greater.
Those working on safety who aren’t actively trying to speed up their research with AI agents, should. I found the last 6 months to be a step-change in the usefulness of AI agents.

^
If they tried to do more than a few iterations, I expect today’s models to get too off-track