I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
i.e. the scaffold which monitors and modifies the activations, chains-of-thought, and tool use