Some people share a cluster of ideas that I think is broadly correct. I want to write down these ideas explicitly so people can push-back.
The experiments we are running today are kinda ‘bullshit’[1] because the thing we actually care about doesn’t exist yet, i.e. ASL-4, or AI powerful enough that they could cause catastrophe if we were careless about deployment.
The experiments in pre-crunch-time use pretty bad proxies.
90% of the “actual” work will occur in early-crunch-time, which is the duration between (i) training the first ASL-4 model, and (ii) internally deploying the model.
In early-crunch-time, safety-researcher-hours will be an incredible scarce resource.
The cost of delaying internal deployment will be very high: a billion dollars of revenue per day, competitive winner-takes-all race dynamics, etc.
There might be far fewer safety researchers in the lab than there currently are in the whole community.
Because safety-researcher-hours will be such a scarce resource, it’s worth spending months in pre-crunch-time to save ourselves days (or even hours) in early-crunch-time.
Therefore, even though the pre-crunch-time experiments aren’t very informative, it still makes sense to run them because they will slightly speed us up in early-crunch-time.
They will speed us up via:
Rough qualitative takeaways like “Let’s try technique A before technique B because in Jones et al. technique A was better than technique B.” However, the exact numbers in the Results table of Jones et al. are not informative beyond that.
The tooling we used to run Jones et al. can be reused for early-crunch-time, c.f. Inspect and TransformerLens.
The community discovers who is well-suited to which kind of role, e.g. Jones is good at large-scale unsupervised mech interp, and Smith is good at red-teaming control protocols.
Sometimes I use the analogy that we’re shooting with rubber bullets, like soldiers do before they fight a real battle. I think that might overstate how good the proxies are, it’s probably more like laser tag. But it’s still worth doing because we don’t have real bullets yet.
I want a better term here. Perhaps “practice-run research” or “weak-proxy research”?
On this perspective, the pre-crunch-time results are highly worthwhile. They just aren’t very informative. And these properties are consistent because the value-per-bit-of-information is so high.
My immediate critique would be step 7: insofar as people are updating today on experiments which are bullshit, that is likely to slow us down during early crunch, not speed us up. Or, worse, result in outright failure to notice fatal problems. Rather than going in with no idea what’s going on, people will go in with too-confident wrong ideas of what’s going on.
To a perfect Bayesian, a bullshit experiment would be small value, but never negative. Humans are not perfect Bayesians, and a bullshit experiment can very much be negative value to us.
Yep, I’ll bite the bullet here. This is a real problem and partly my motivation for writing the perspective explicitly.
I think people who are “in the know” are good at not over-updating on the quantitative results. And they’re good at explaining that the experiments are weak proxies which should be interpreted qualitatively at best. But people “out of the know” (e.g. junior ai safety researches) tend to overupdate and probably read the senior researchers as professing generic humility.
I would guess that even the “in the know” people are over-updating, because they usually are Not Measuring What They Think They Are Measuring even qualitatively. Like, the proxies are so weak that the hypothesis “this result will qualitatively generalize to <whatever they actually want to know about>” shouldn’t have been privileged in the first place, and the right thing for a human to do is ignore it completely.
Who (besides yourself) has this position? I feel like believing the safety research we do now is bullshit is highly correlated with thinking its also useless and we should do something else.
I do, though maybe not this extreme. Roughly every other day I bemoan the fact that AIs aren’t misaligned yet (limiting the excitingness of my current research) and might not even be misaligned in future, before reminding myself our world is much better to live in than the alternative. I think there’s not much else to do with a similar impact given how large even a 1% p(doom) reduction is. But I also believe that particularly good research now can trade 1:1 with crunch time.
Theoretical work is just another step removed from the problem and should be viewed with at least as much suspicion.
I like your emphasis on good research. I agree that the best current research does probably trade 1:1 with crunch time.
I think we should apply the same qualification to theoretical research. Well-directed theory is highly useful; poorly-directed theory is almost useless in expectation.
I think theory directed specifically at LLM-based takeover-capable systems is neglected, possibly in part because empiricists focused on LLMs distrust theory, while theorists tend to dislike messy LLMs.
I share almost exactly this opinion, and I hope it’s fairly widespread.
The issue is that almost all of the “something elses” seem even less productive on expectation.
(That’s for technical approaches. The communication-minded should by all means be working on spreading the alarm and so slowing progress and raising the ambient levels fo risk-awareness).
LLM research could and should get a lot more focused on future risks instead of current ones. But I don’t see alternatives that realistically have more EV.
It really looks like the best guess is that AGI is now quite likely to be descended from LLMs. And I see little practical hope of pausing that progress. So accepting the probabilities on the game board and researching LLMs/transformers makes sense even when it’s mostly practice and gaining just a little bit of knowledge of how LLMs/transformers/networks represent knowledge and generate behaviors.
It’s of course down to individual research programs; there’s a bunch of really irrelevant LLM research that would be better directed elsewhere. And having a little effort directed to unlikely scenarios where we get very different AGI is also defensible—as long as it’s defended, not just hope-based.
This is of course a major outstanding debate, and needs to be had carefully. But I’d really like to see more of this type of careful thinking about the likely efficiency of different research routes.
I think there’s low-hanging fruit in trying to improve research on LLMs to anticipate the new challenges that arrive when LLM-descended AGI becomes actually dangerous. My recent post LLM AGI may reason about its goals and discover misalignments by default suggests research addressing one fairly obvious possible new risk when LLM-based systems become capable of competent reasoning and planning.
IIRC I heard the “we’re spending months now to save ourselves days (or even hours) later” from the control guys, but I don’t know if they’d endorse the perspective I’ve outlined
I do, which is why I’ve always placed much more emphasis on figuring out how to do automated AI safety research as safely as we can, rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy (but are good for gaining reputation in and out of the community, cause it looks legit).
That said, I think one of the best things we can hope for is that these techniques at least help us to safely get useful alignment research in the lead up to where it all breaks and that it allows us to figure out better techniques that do scale for the next generation while also having a good safety-usefulness tradeoff.
rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy
To clarify, this means you don’t hold the position I expressed. On the view I expressed, experiments using weak proxies are worthwhile even though they aren’t very informative
Hmm, so I still hold the view that they are worthwhile even if they are not informative, particularly for the reasons you seem to have pointed to (i.e. training up good human researchers to identify who has a knack for a specific style of research s.t. we can use them for providing initial directions to AIs automating AI safety R&D as well as serving as model output verifiers OR building infra that ends up being used by AIs that are good enough to do tons of experiments leveraging that infra but not good enough to come up with completely new paradigms).
how confident are you that safety researchers will be able to coordinate at crunch time, and it won’t be eg. only safety researchers at one lab?
without taking things like personal fit into account, how would you compare say doing prosaic ai safety research pre-crunch time to policy interventions helping you coordinate better at crunch time (for instance helping safety teams coordinate better at crunch time, or even buying more crunch time)?
I do think that safety researchers might be good at coordinating even if the labs aren’t. For example, safety researchers tend to be more socially connected, and also they share similar goals and beliefs.
Labs have more incentive to share safety research than capabilities research, because the harms of AI are mostly externalised whereas the benefits of AI are mostly internalised.
This includes extinction obviously, but also misuse and accidental harms which would cause industry-wide regulations and distrust.
While I’m not excited by pausing AI[1], I do support pushing labs to do more safety work between training and deployment.[2][3]
I think sharp takeoff speeds are scarier than short timelines.
I think we can increase the effective-crunch-time by deploying Claude-n to automate much of the safety work that must occur between training and deploying Claude-(n+1). But I don’t know if there’s any ways which accelerate Claude-n at safety work but not the capabilities work.
1. What the target goal of early-crunch time research should be (i.e. control safety case for the specific model one has at the present moment, trustworthy case for this specific model, trustworthy safety case for the specific model and deference case for future models, trustworthy safety case for all future models, etc...)
2. The rough shape(s) of that case (i.e. white-box evaluations, control guardrails, convergence guarantees, etc...)
3. What kinds of evidence you expect to accumulate given access to these early powerful models.
I expect I disagree with the view presented, but without clarification on the points above I’m not certain. I also expect my cruxes would route through these points
I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
Should we just wait for research systems/models to get better?
[...] Moreover, once end-to-end automation is possible, it will still take time to integrate those capabilities into real projects, so we should be building the necessary infrastructure and experience now. As Ryan Greenblatt has said, “Further, it seems likely we’ll run into integration delays and difficulties speeding up security and safety work in particular[…]. Quite optimistically, we might have a year with 3× AIs and a year with 10× AIs and we might lose half the benefit due to integration delays, safety taxes, and difficulties accelerating safety work. This would yield 6 additional effective years[…].” Building automated AI safety R&D ecosystems early ensures we’re ready when more capable systems arrive.
Research automation timelines should inform research plans
It’s worth reflecting on scheduling AI safety research based on when we expect sub-areas of safety research will be automatable. For example, it may be worth putting off R&D-heavy projects until we can get AI agents to automate our detailed plans for such projects. If you predict that it will take you 6 months to 1 year to do an R&D-heavy project, you might get more research mileage by writing a project proposal for this project and then focusing on other directions that are tractable now. Oftentimes it’s probably better to complete 10 small projects in 6 months and then one big project in an additional 2 months, rather than completing one big project in 7 months.
This isn’t to say that R&D-heavy projects are not worth pursuing—big projects that are harder to automate may still be worth prioritizing if you expect them to substantially advance downstream projects (such as ControlArena from UK AISI). But research automation will rapidly transform what is ‘low-hanging fruit’. Research directions that are currently impossible due to the time or necessary R&D required may quickly go from intractable to feasible to trivial. Carefully adapting your code, your workflow, and your research plans for research automation is something you can—and likely should—do now.
I’m also very interested in having more discussions on what a defence-in-depth approach would look like for early automated safety R&D, so that we can get value from it for longer and point the system towards the specific kinds of projects that will lead to techniques that scale to the next scale-up / capability increase.
A piece of pushback: there might not be a clearly defined crunch time at all. If we get (or are currently in!) a very slow takeoff to AGI, the timing of when an AI starts to become dangerous might be ambiguous. For example, you refer to early crunch time as the time between training and deploying an ASL-4 model, but the implementation of early possibly-dangerous AI might not follow the train-and-deploy pattern. It might instead look more like gradually adding and swapping out components in a framework that includes multiple models and tools. The point at which the overall system becomes dangerous might not be noticeable until significantly after the fact, especially if the lab is quickly iterating on a lot of different configurations.
Prosaic AI Safety research, in pre-crunch time.
Some people share a cluster of ideas that I think is broadly correct. I want to write down these ideas explicitly so people can push-back.
The experiments we are running today are kinda ‘
bullshit’[1] because the thing we actually care about doesn’t exist yet, i.e. ASL-4, or AI powerful enough that they could cause catastrophe if we were careless about deployment.The experiments in pre-crunch-time use pretty bad proxies.
90% of the “actual” work will occur in early-crunch-time, which is the duration between (i) training the first ASL-4 model, and (ii) internally deploying the model.
In early-crunch-time, safety-researcher-hours will be an incredible scarce resource.
The cost of delaying internal deployment will be very high: a billion dollars of revenue per day, competitive winner-takes-all race dynamics, etc.
There might be far fewer safety researchers in the lab than there currently are in the whole community.
Because safety-researcher-hours will be such a scarce resource, it’s worth spending months in pre-crunch-time to save ourselves days (or even hours) in early-crunch-time.
Therefore, even though the pre-crunch-time experiments aren’t very informative, it still makes sense to run them because they will slightly speed us up in early-crunch-time.
They will speed us up via:
Rough qualitative takeaways like “Let’s try technique A before technique B because in Jones et al. technique A was better than technique B.” However, the exact numbers in the Results table of Jones et al. are not informative beyond that.
The tooling we used to run Jones et al. can be reused for early-crunch-time, c.f. Inspect and TransformerLens.
The community discovers who is well-suited to which kind of role, e.g. Jones is good at large-scale unsupervised mech interp, and Smith is good at red-teaming control protocols.
Sometimes I use the analogy that we’re shooting with rubber bullets, like soldiers do before they fight a real battle. I think that might overstate how good the proxies are, it’s probably more like laser tag. But it’s still worth doing because we don’t have real bullets yet.
I want a better term here. Perhaps “practice-run research” or “weak-proxy research”?
On this perspective, the pre-crunch-time results are highly worthwhile. They just aren’t very informative. And these properties are consistent because the value-per-bit-of-information is so high.
My immediate critique would be step 7: insofar as people are updating today on experiments which are bullshit, that is likely to slow us down during early crunch, not speed us up. Or, worse, result in outright failure to notice fatal problems. Rather than going in with no idea what’s going on, people will go in with too-confident wrong ideas of what’s going on.
To a perfect Bayesian, a bullshit experiment would be small value, but never negative. Humans are not perfect Bayesians, and a bullshit experiment can very much be negative value to us.
Yep, I’ll bite the bullet here. This is a real problem and partly my motivation for writing the perspective explicitly.
I think people who are “in the know” are good at not over-updating on the quantitative results. And they’re good at explaining that the experiments are weak proxies which should be interpreted qualitatively at best. But people “out of the know” (e.g. junior ai safety researches) tend to overupdate and probably read the senior researchers as professing generic humility.
I would guess that even the “in the know” people are over-updating, because they usually are Not Measuring What They Think They Are Measuring even qualitatively. Like, the proxies are so weak that the hypothesis “this result will qualitatively generalize to <whatever they actually want to know about>” shouldn’t have been privileged in the first place, and the right thing for a human to do is ignore it completely.
Who (besides yourself) has this position? I feel like believing the safety research we do now is bullshit is highly correlated with thinking its also useless and we should do something else.
I do, though maybe not this extreme. Roughly every other day I bemoan the fact that AIs aren’t misaligned yet (limiting the excitingness of my current research) and might not even be misaligned in future, before reminding myself our world is much better to live in than the alternative. I think there’s not much else to do with a similar impact given how large even a 1% p(doom) reduction is. But I also believe that particularly good research now can trade 1:1 with crunch time.
Theoretical work is just another step removed from the problem and should be viewed with at least as much suspicion.
I like your emphasis on good research. I agree that the best current research does probably trade 1:1 with crunch time.
I think we should apply the same qualification to theoretical research. Well-directed theory is highly useful; poorly-directed theory is almost useless in expectation.
I think theory directed specifically at LLM-based takeover-capable systems is neglected, possibly in part because empiricists focused on LLMs distrust theory, while theorists tend to dislike messy LLMs.
I share almost exactly this opinion, and I hope it’s fairly widespread.
The issue is that almost all of the “something elses” seem even less productive on expectation.
(That’s for technical approaches. The communication-minded should by all means be working on spreading the alarm and so slowing progress and raising the ambient levels fo risk-awareness).
LLM research could and should get a lot more focused on future risks instead of current ones. But I don’t see alternatives that realistically have more EV.
It really looks like the best guess is that AGI is now quite likely to be descended from LLMs. And I see little practical hope of pausing that progress. So accepting the probabilities on the game board and researching LLMs/transformers makes sense even when it’s mostly practice and gaining just a little bit of knowledge of how LLMs/transformers/networks represent knowledge and generate behaviors.
It’s of course down to individual research programs; there’s a bunch of really irrelevant LLM research that would be better directed elsewhere. And having a little effort directed to unlikely scenarios where we get very different AGI is also defensible—as long as it’s defended, not just hope-based.
This is of course a major outstanding debate, and needs to be had carefully. But I’d really like to see more of this type of careful thinking about the likely efficiency of different research routes.
I think there’s low-hanging fruit in trying to improve research on LLMs to anticipate the new challenges that arrive when LLM-descended AGI becomes actually dangerous. My recent post LLM AGI may reason about its goals and discover misalignments by default suggests research addressing one fairly obvious possible new risk when LLM-based systems become capable of competent reasoning and planning.
Bullshit was a poor choice of words. A better choice would’ve been “weak proxy”. On this view, this is still very worthwhile. See footnote.
IIRC I heard the “we’re spending months now to save ourselves days (or even hours) later” from the control guys, but I don’t know if they’d endorse the perspective I’ve outlined
I do, which is why I’ve always placed much more emphasis on figuring out how to do automated AI safety research as safely as we can, rather than trying to come up with some techniques that seem useful at the current scale but will ultimately be a weak proxy (but are good for gaining reputation in and out of the community, cause it looks legit).
That said, I think one of the best things we can hope for is that these techniques at least help us to safely get useful alignment research in the lead up to where it all breaks and that it allows us to figure out better techniques that do scale for the next generation while also having a good safety-usefulness tradeoff.
To clarify, this means you don’t hold the position I expressed. On the view I expressed, experiments using weak proxies are worthwhile even though they aren’t very informative
Hmm, so I still hold the view that they are worthwhile even if they are not informative, particularly for the reasons you seem to have pointed to (i.e. training up good human researchers to identify who has a knack for a specific style of research s.t. we can use them for providing initial directions to AIs automating AI safety R&D as well as serving as model output verifiers OR building infra that ends up being used by AIs that are good enough to do tons of experiments leveraging that infra but not good enough to come up with completely new paradigms).
how confident are you that safety researchers will be able to coordinate at crunch time, and it won’t be eg. only safety researchers at one lab?
without taking things like personal fit into account, how would you compare say doing prosaic ai safety research pre-crunch time to policy interventions helping you coordinate better at crunch time (for instance helping safety teams coordinate better at crunch time, or even buying more crunch time)?
Not confident at all.
I do think that safety researchers might be good at coordinating even if the labs aren’t. For example, safety researchers tend to be more socially connected, and also they share similar goals and beliefs.
Labs have more incentive to share safety research than capabilities research, because the harms of AI are mostly externalised whereas the benefits of AI are mostly internalised.
This includes extinction obviously, but also misuse and accidental harms which would cause industry-wide regulations and distrust.
Even a few safety researchers at the lab could reduce catastrophic risk.
The recent OpenAI-Anthropic collaboration is super good news. We should be giving them more cudos for this.
OpenAI evaluates Anthropic models
Anthropic evaluates OpenAI models
I think buying more crunch time is great.
While I’m not excited by pausing AI[1], I do support pushing labs to do more safety work between training and deployment.[2][3]
I think sharp takeoff speeds are scarier than short timelines.
I think we can increase the effective-crunch-time by deploying Claude-n to automate much of the safety work that must occur between training and deploying Claude-(n+1). But I don’t know if there’s any ways which accelerate Claude-n at safety work but not the capabilities work.
I think it’s an honorable goal, but seems infeasible given the current landscape.
c.f. RSPs are pauses done right
Although I think the critical period for safety evals is between training and internal deployment, not training and external deployment. See Greenblatt’s Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
I’m curious if you have a sense of:
1. What the target goal of early-crunch time research should be (i.e. control safety case for the specific model one has at the present moment, trustworthy case for this specific model, trustworthy safety case for the specific model and deference case for future models, trustworthy safety case for all future models, etc...)
2. The rough shape(s) of that case (i.e. white-box evaluations, control guardrails, convergence guarantees, etc...)
3. What kinds of evidence you expect to accumulate given access to these early powerful models.
I expect I disagree with the view presented, but without clarification on the points above I’m not certain. I also expect my cruxes would route through these points
I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
i.e. the scaffold which monitors and modifies the activations, chains-of-thought, and tool use
For those who haven’t seen, coming from the same place as OP, I describe my thoughts in Automating AI Safety: What we can do today.
Specifically in the side notes:
Should we just wait for research systems/models to get better?
[...] Moreover, once end-to-end automation is possible, it will still take time to integrate those capabilities into real projects, so we should be building the necessary infrastructure and experience now. As Ryan Greenblatt has said, “Further, it seems likely we’ll run into integration delays and difficulties speeding up security and safety work in particular[…]. Quite optimistically, we might have a year with 3× AIs and a year with 10× AIs and we might lose half the benefit due to integration delays, safety taxes, and difficulties accelerating safety work. This would yield 6 additional effective years[…].” Building automated AI safety R&D ecosystems early ensures we’re ready when more capable systems arrive.
Research automation timelines should inform research plans
It’s worth reflecting on scheduling AI safety research based on when we expect sub-areas of safety research will be automatable. For example, it may be worth putting off R&D-heavy projects until we can get AI agents to automate our detailed plans for such projects. If you predict that it will take you 6 months to 1 year to do an R&D-heavy project, you might get more research mileage by writing a project proposal for this project and then focusing on other directions that are tractable now. Oftentimes it’s probably better to complete 10 small projects in 6 months and then one big project in an additional 2 months, rather than completing one big project in 7 months.
This isn’t to say that R&D-heavy projects are not worth pursuing—big projects that are harder to automate may still be worth prioritizing if you expect them to substantially advance downstream projects (such as ControlArena from UK AISI). But research automation will rapidly transform what is ‘low-hanging fruit’. Research directions that are currently impossible due to the time or necessary R&D required may quickly go from intractable to feasible to trivial. Carefully adapting your code, your workflow, and your research plans for research automation is something you can—and likely should—do now.
I’m also very interested in having more discussions on what a defence-in-depth approach would look like for early automated safety R&D, so that we can get value from it for longer and point the system towards the specific kinds of projects that will lead to techniques that scale to the next scale-up / capability increase.
A piece of pushback: there might not be a clearly defined crunch time at all. If we get (or are currently in!) a very slow takeoff to AGI, the timing of when an AI starts to become dangerous might be ambiguous. For example, you refer to early crunch time as the time between training and deploying an ASL-4 model, but the implementation of early possibly-dangerous AI might not follow the train-and-deploy pattern. It might instead look more like gradually adding and swapping out components in a framework that includes multiple models and tools. The point at which the overall system becomes dangerous might not be noticeable until significantly after the fact, especially if the lab is quickly iterating on a lot of different configurations.