1. What the target goal of early-crunch time research should be (i.e. control safety case for the specific model one has at the present moment, trustworthy case for this specific model, trustworthy safety case for the specific model and deference case for future models, trustworthy safety case for all future models, etc...)
2. The rough shape(s) of that case (i.e. white-box evaluations, control guardrails, convergence guarantees, etc...)
3. What kinds of evidence you expect to accumulate given access to these early powerful models.
I expect I disagree with the view presented, but without clarification on the points above I’m not certain. I also expect my cruxes would route through these points
I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
I’m curious if you have a sense of:
1. What the target goal of early-crunch time research should be (i.e. control safety case for the specific model one has at the present moment, trustworthy case for this specific model, trustworthy safety case for the specific model and deference case for future models, trustworthy safety case for all future models, etc...)
2. The rough shape(s) of that case (i.e. white-box evaluations, control guardrails, convergence guarantees, etc...)
3. What kinds of evidence you expect to accumulate given access to these early powerful models.
I expect I disagree with the view presented, but without clarification on the points above I’m not certain. I also expect my cruxes would route through these points
I think constructing safety cases for current models shouldn’t be the target of current research. That’s because our best safety case for current models will be incapacity-based, and the methods in that case won’t help you construct a safety case for powerful models.
What the target goal of early-crunch time research should be?
Think about some early crunch time problem.
Reason conceptually about it.
Identify some relevant dynamics you’re uncertain about.
Build a weak proxy using current models that qualitatively captures a dynamic you’re interested in.
Run the experiment.
Extract qualitative takeaways, hopefully.
Try not to over-update on the exact quantitative results.
What kinds of evidence you expect to accumulate given access to these early powerful models.
The evidence is how well our combined techniques actually work. Like, we have access to the actual AIs and the actual deployment plan[1] and we see whether the red-team can actually cause a catastrophe. And the results are quantitatively informative because we aren’t using a weak proxy.
i.e. the scaffold which monitors and modifies the activations, chains-of-thought, and tool use