I think I roughly stand behind my perspective in this dialogue. I feel somewhat more cynical than I did at the time I did this dialogue, perhaps partially due to actual updates from the world and partially because I was trying to argue for the optimistic case here which put me in a somewhat different frame.
Here are some ways my perspective differs now:
I wish I said something like: “AI companies probably won’t actually pause unilaterally, so the hope for voluntary RSPs has to be building consensus or helping to motivate developing countermeasures”. I don’t think I would have disagreed with this statement in the past, or at least I wouldn’t have fully disagreed with it and it seems like important context.
I think in practice, we’re unlikely to end up with specific tests that are defined in advance and aren’t goodhartable or cheatable. I do think that control could in principle be defined in advance and hard to goodhart using external evaluation, but I don’t expect companies to commit to specific tests which are hard to goodhart/cheat. They could make procedural commitments for third party review which are hard to cheat. Something like “this third party will review the available evidence (including our safety report and all applicable internal knowledge) and then make a public statement about the level of risk and whether there is important information which should be disclosed to the public” (I could outline this proposal in more detail, it’s mostly not my original idea.)
I’m somewhat more interested in companies focusing on things other than safety cases and commitments. Either trying to get evidence of risk that might be convincing to others (in worlds where these risks are large) or working on at-the-margin safety interventions from a cost benefit perspective.
I think I roughly stand behind my perspective in this dialogue. I feel somewhat more cynical than I did at the time I did this dialogue, perhaps partially due to actual updates from the world and partially because I was trying to argue for the optimistic case here which put me in a somewhat different frame.
Here are some ways my perspective differs now:
I wish I said something like: “AI companies probably won’t actually pause unilaterally, so the hope for voluntary RSPs has to be building consensus or helping to motivate developing countermeasures”. I don’t think I would have disagreed with this statement in the past, or at least I wouldn’t have fully disagreed with it and it seems like important context.
I think in practice, we’re unlikely to end up with specific tests that are defined in advance and aren’t goodhartable or cheatable. I do think that control could in principle be defined in advance and hard to goodhart using external evaluation, but I don’t expect companies to commit to specific tests which are hard to goodhart/cheat. They could make procedural commitments for third party review which are hard to cheat. Something like “this third party will review the available evidence (including our safety report and all applicable internal knowledge) and then make a public statement about the level of risk and whether there is important information which should be disclosed to the public” (I could outline this proposal in more detail, it’s mostly not my original idea.)
I’m somewhat more interested in companies focusing on things other than safety cases and commitments. Either trying to get evidence of risk that might be convincing to others (in worlds where these risks are large) or working on at-the-margin safety interventions from a cost benefit perspective.