Additive versus Multiplicative model of AI-assisted research
Occasionally one hears somebody say “most of the relevant AI-safety work will be done at crunch time. Most work being done now at present is marginal”.
One cannot shake the suspicion that this statement merely reflects the paucity of ideas & vision of the speaker. Yet it cannot be denied that their reasoning has a certain logic: if, as seems likely, AI will become more and more dominant in AI alignment research than maybe we should be focusing on how to safely extract work from future superintelligent machines rather than hurting our painfully slow mammalian walnuts to crack AI safety research today. I understand this to be a key motivation for several popular agendas AI safety.
Similarly, many Pause advocates argue that pause advocacy is more impactful than direct research. Most will admit that a Pause cannot be maintained indefinitely. The aim of a Pause would be to buy time to figure out alignment. Unless one believes in very long pauses, implicitly it seems there is an assumption that research progress will be faster in the future.
Implicitly, we might say there is underlying ” Additive” model of AI-assisted research: there is a certain amount of research that is to be done to solve alignment. Humans do some of it, AI does some (more). If the sum is large enough we are golden.
Contrast this with a ” multiplicative” model of AI-assisted research:
Humans increasingly resort to a supervisory and directing role for AI alignment research by AIs. Human experts become ‘research managers’ of AI grad students. The key bottleneck increasingly is the capacity for humans to effectively oversee, supervise, direct, and judge the AI’s research output. In this model research effort by humans and AI is multiplicative. The better the understanding of the humans, the more AI can be effectively leveraged.
These are highly simplified models of course. I still think it’s worthwhile to keep them in mind since they imply very different courses of action. In the additive framework direct research is somewhat marginal. In the multiplicative model every research XP point that humans build today is worth 10x tomorrow in effective research output at crunch time. [1]
Moreover, in the latter model it is especially important to start thinking today about very ambitious plans/ideas/methods that are less likely to be obsoleted the moment the next generation of AI comes along.
Let’s say you want AIs in Early Crunch Time to solve your pet problem X, and you think the bottleneck will be your ability to oversee, supervise, direct, and judge the AI’s research output.
Here are two things you can do now, in 2025:
You can do X research yourself, so that you know more about X and can act like a grad supervisor.
You can work on scalable oversight = {cot monitoring, activation monitoring, control protocols, formal verification, debate} so you can deploy more powerful models for the same chance of catastrophe.
I expect (2) will help more with X. I don’t think being a grad supervisor will be a bottleneck for very long, compared with stuff like ‘we can’t deploy the more powerful AI bc our scalable oversight isn’t good enough to prevent catastrophe’. It also has the nice property that the gains will spread to everyone’s pet problem, not just your own.
I’m having trouble seeing that (2) is actually a thing?
The whole problem is that there is no generally agreed ” chance of catastrophe” so ” same change of catastrophe” has no real meaning. It seems this kind of talk is being backchained from what governance people want as opposed to the reality of safety guarantees or safety/risk probabilities—which is that they don’t meaningfully exist [outside of super heuristic guesses].
Indeed, to estimate this probability in a non-bullshit way we exactly need fundamental scientific progress, i.e. (1).
EA has done this exercise a dozen times: if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
Will that change? Will expert judgement converge? Maybe. Maybe not. I don’t have a crystal ball.
Even if they do [outside of meaningful progress on (1)] those probabilities won’t actually reflect reality as opposed to political reality.
The problem is there is no ′ scientific’ way to estimate p(doom) and as long as we don’t make serious progress on 1. there won’t be.
I don’t see how cot/activation/control monitoring will have any significant and scientifically -grounded [as opposed to purely story-telling/politics] influence in a way that can be measured and can be utilized to make risk-tradeoffs.
if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
What matters here is the chance that a particular model in a particular deployment plan will cause a particular catastrophe. And this is after the model has been trained and evaluated and redteamed and mech interped (imperfectly ofc). I don’t except such divergent probabilities from experts.
Is one takeaway of your post that we should consider current safety research as more about training human researchers than about the actual knowledge obtained from the research?
I think there’s a lot of truth to this—modern LLMs are kind of competence multiplier, where some competence values are negative (perhaps a competence exponentiator?).
I find that I can extract value from LLMs only if I’m asking about something that I almost already know. That way I can judge whether an answer is getting at the wrong thing, assess the relevance of citations, and verify a correct answer rapidly and highly robustly if it is offered (which is important because typically a series of convincing non-answers or wrong answers comes first).
Though LLMs seem to be getting more useful in the best case, they also seem to be getting more dangerous in the worst case, so I am not sure whether this dynamic will soften or sharpen over time.
Additive versus Multiplicative model of AI-assisted research
Occasionally one hears somebody say “most of the relevant AI-safety work will be done at crunch time. Most work being done now at present is marginal”.
One cannot shake the suspicion that this statement merely reflects the paucity of ideas & vision of the speaker. Yet it cannot be denied that their reasoning has a certain logic: if, as seems likely, AI will become more and more dominant in AI alignment research than maybe we should be focusing on how to safely extract work from future superintelligent machines rather than hurting our painfully slow mammalian walnuts to crack AI safety research today. I understand this to be a key motivation for several popular agendas AI safety.
Similarly, many Pause advocates argue that pause advocacy is more impactful than direct research. Most will admit that a Pause cannot be maintained indefinitely. The aim of a Pause would be to buy time to figure out alignment. Unless one believes in very long pauses, implicitly it seems there is an assumption that research progress will be faster in the future.
Implicitly, we might say there is underlying ” Additive” model of AI-assisted research: there is a certain amount of research that is to be done to solve alignment. Humans do some of it, AI does some (more). If the sum is large enough we are golden.
Contrast this with a ” multiplicative” model of AI-assisted research:
Humans increasingly resort to a supervisory and directing role for AI alignment research by AIs. Human experts become ‘research managers’ of AI grad students. The key bottleneck increasingly is the capacity for humans to effectively oversee, supervise, direct, and judge the AI’s research output. In this model research effort by humans and AI is multiplicative. The better the understanding of the humans, the more AI can be effectively leveraged.
These are highly simplified models of course. I still think it’s worthwhile to keep them in mind since they imply very different courses of action. In the additive framework direct research is somewhat marginal. In the multiplicative model every research XP point that humans build today is worth 10x tomorrow in effective research output at crunch time. [1]
Moreover, in the latter model it is especially important to start thinking today about very ambitious plans/ideas/methods that are less likely to be obsoleted the moment the next generation of AI comes along.
Let’s say you want AIs in Early Crunch Time to solve your pet problem X, and you think the bottleneck will be your ability to oversee, supervise, direct, and judge the AI’s research output.
Here are two things you can do now, in 2025:
You can do X research yourself, so that you know more about X and can act like a grad supervisor.
You can work on scalable oversight = {cot monitoring, activation monitoring, control protocols, formal verification, debate} so you can deploy more powerful models for the same chance of catastrophe.
I expect (2) will help more with X. I don’t think being a grad supervisor will be a bottleneck for very long, compared with stuff like ‘we can’t deploy the more powerful AI bc our scalable oversight isn’t good enough to prevent catastrophe’. It also has the nice property that the gains will spread to everyone’s pet problem, not just your own.
I’m having trouble seeing that (2) is actually a thing?
The whole problem is that there is no generally agreed ” chance of catastrophe” so ” same change of catastrophe” has no real meaning.
It seems this kind of talk is being backchained from what governance people want as opposed to the reality of safety guarantees or safety/risk probabilities—which is that they don’t meaningfully exist [outside of super heuristic guesses].
Indeed, to estimate this probability in a non-bullshit way we exactly need fundamental scientific progress, i.e. (1).
EA has done this exercise a dozen times: if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
Will that change? Will expert judgement converge? Maybe. Maybe not. I don’t have a crystal ball.
Even if they do [outside of meaningful progress on (1)] those probabilities won’t actually reflect reality as opposed to political reality.
The problem is there is no ′ scientific’ way to estimate p(doom) and as long as we don’t make serious progress on 1. there won’t be.
I don’t see how cot/activation/control monitoring will have any significant and scientifically -grounded [as opposed to purely story-telling/politics] influence in a way that can be measured and can be utilized to make risk-tradeoffs.
What matters here is the chance that a particular model in a particular deployment plan will cause a particular catastrophe. And this is after the model has been trained and evaluated and redteamed and mech interped (imperfectly ofc). I don’t except such divergent probabilities from experts.
Referring to particular models and particular deployment plans and particular catastrophe doesn’t help—the answer is the same.
We don’t know how to scientifically quantify any of these probabilities.
You can replace “deployment plan P1 has same chance of catastrophe as deployment plan P2” with “the safety team is equally worried by P1 and P2″.
In Pre-Crunch Time, here’s what I think we should be working on:
Problems that need to be solved in order that we can safely and usefully accelerate R&D, i.e. scalable oversight
Problems that will be hard to accelerate with R&D, e.g. very conceptual stuff, policy
Problems that you think, with a few years of grinding, could be reduced from hard-to-accelerate to easy-to-accelerate, e.g. ARC Theory
Notably, what ‘work’ looks like on these problems will depend heavily on which of the three buckets it belongs to.
Is one takeaway of your post that we should consider current safety research as more about training human researchers than about the actual knowledge obtained from the research?
I didn’t intend it that way, though admittedly that is a valid reading. From my own point of view both functions seem significant.
I think there’s a lot of truth to this—modern LLMs are kind of competence multiplier, where some competence values are negative (perhaps a competence exponentiator?).
I find that I can extract value from LLMs only if I’m asking about something that I almost already know. That way I can judge whether an answer is getting at the wrong thing, assess the relevance of citations, and verify a correct answer rapidly and highly robustly if it is offered (which is important because typically a series of convincing non-answers or wrong answers comes first).
Though LLMs seem to be getting more useful in the best case, they also seem to be getting more dangerous in the worst case, so I am not sure whether this dynamic will soften or sharpen over time.