To be clear, I think there are important additional considerations related to the fact that we don’t just care about capabilities that aren’t covered in that section, though that section is not that far from what I would say if you renamed it to “behavioral tests”, including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving: 1. looks good in behavioral tests 2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.
I don’t expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.
See section 6: “The capability condition”
But it isn’t a capabilities condition? Maybe I would be happier if you renamed this section.
To be clear, I think there are important additional considerations related to the fact that we don’t just care about capabilities that aren’t covered in that section, though that section is not that far from what I would say if you renamed it to “behavioral tests”, including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.