Let’s say you want AIs in Early Crunch Time to solve your pet problem X, and you think the bottleneck will be your ability to oversee, supervise, direct, and judge the AI’s research output.
Here are two things you can do now, in 2025:
You can do X research yourself, so that you know more about X and can act like a grad supervisor.
You can work on scalable oversight = {cot monitoring, activation monitoring, control protocols, formal verification, debate} so you can deploy more powerful models for the same chance of catastrophe.
I expect (2) will help more with X. I don’t think being a grad supervisor will be a bottleneck for very long, compared with stuff like ‘we can’t deploy the more powerful AI bc our scalable oversight isn’t good enough to prevent catastrophe’. It also has the nice property that the gains will spread to everyone’s pet problem, not just your own.
I’m having trouble seeing that (2) is actually a thing?
The whole problem is that there is no generally agreed ” chance of catastrophe” so ” same change of catastrophe” has no real meaning. It seems this kind of talk is being backchained from what governance people want as opposed to the reality of safety guarantees or safety/risk probabilities—which is that they don’t meaningfully exist [outside of super heuristic guesses].
Indeed, to estimate this probability in a non-bullshit way we exactly need fundamental scientific progress, i.e. (1).
EA has done this exercise a dozen times: if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
Will that change? Will expert judgement converge? Maybe. Maybe not. I don’t have a crystal ball.
Even if they do [outside of meaningful progress on (1)] those probabilities won’t actually reflect reality as opposed to political reality.
The problem is there is no ′ scientific’ way to estimate p(doom) and as long as we don’t make serious progress on 1. there won’t be.
I don’t see how cot/activation/control monitoring will have any significant and scientifically -grounded [as opposed to purely story-telling/politics] influence in a way that can be measured and can be utilized to make risk-tradeoffs.
if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
What matters here is the chance that a particular model in a particular deployment plan will cause a particular catastrophe. And this is after the model has been trained and evaluated and redteamed and mech interped (imperfectly ofc). I don’t except such divergent probabilities from experts.
Let’s say you want AIs in Early Crunch Time to solve your pet problem X, and you think the bottleneck will be your ability to oversee, supervise, direct, and judge the AI’s research output.
Here are two things you can do now, in 2025:
You can do X research yourself, so that you know more about X and can act like a grad supervisor.
You can work on scalable oversight = {cot monitoring, activation monitoring, control protocols, formal verification, debate} so you can deploy more powerful models for the same chance of catastrophe.
I expect (2) will help more with X. I don’t think being a grad supervisor will be a bottleneck for very long, compared with stuff like ‘we can’t deploy the more powerful AI bc our scalable oversight isn’t good enough to prevent catastrophe’. It also has the nice property that the gains will spread to everyone’s pet problem, not just your own.
I’m having trouble seeing that (2) is actually a thing?
The whole problem is that there is no generally agreed ” chance of catastrophe” so ” same change of catastrophe” has no real meaning.
It seems this kind of talk is being backchained from what governance people want as opposed to the reality of safety guarantees or safety/risk probabilities—which is that they don’t meaningfully exist [outside of super heuristic guesses].
Indeed, to estimate this probability in a non-bullshit way we exactly need fundamental scientific progress, i.e. (1).
EA has done this exercise a dozen times: if you ask experts the probabilities of doom it ranges all the way from 99% to 0.0001 %.
Will that change? Will expert judgement converge? Maybe. Maybe not. I don’t have a crystal ball.
Even if they do [outside of meaningful progress on (1)] those probabilities won’t actually reflect reality as opposed to political reality.
The problem is there is no ′ scientific’ way to estimate p(doom) and as long as we don’t make serious progress on 1. there won’t be.
I don’t see how cot/activation/control monitoring will have any significant and scientifically -grounded [as opposed to purely story-telling/politics] influence in a way that can be measured and can be utilized to make risk-tradeoffs.
What matters here is the chance that a particular model in a particular deployment plan will cause a particular catastrophe. And this is after the model has been trained and evaluated and redteamed and mech interped (imperfectly ofc). I don’t except such divergent probabilities from experts.
Referring to particular models and particular deployment plans and particular catastrophe doesn’t help—the answer is the same.
We don’t know how to scientifically quantify any of these probabilities.
You can replace “deployment plan P1 has same chance of catastrophe as deployment plan P2” with “the safety team is equally worried by P1 and P2″.
In Pre-Crunch Time, here’s what I think we should be working on:
Problems that need to be solved in order that we can safely and usefully accelerate R&D, i.e. scalable oversight
Problems that will be hard to accelerate with R&D, e.g. very conceptual stuff, policy
Problems that you think, with a few years of grinding, could be reduced from hard-to-accelerate to easy-to-accelerate, e.g. ARC Theory
Notably, what ‘work’ looks like on these problems will depend heavily on which of the three buckets it belongs to.