I think there is an important component of trustworthiness that you don’t emphasize enough. It isn’t sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don’t think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)
Perhaps you should specifically call out that this post isn’t explaining how to do this testing and this could be hard.
(This is related to Eliezer’s comment and your response to that comment.)
To be clear, I think there are important additional considerations related to the fact that we don’t just care about capabilities that aren’t covered in that section, though that section is not that far from what I would say if you renamed it to “behavioral tests”, including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving: 1. looks good in behavioral tests 2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.
I think there is an important component of trustworthiness that you don’t emphasize enough. It isn’t sufficient to just rule out alignment faking, we need the AI to actually try hard to faithfully pursue our interests including on long, confusing, open-ended, and hard to check tasks. You discuss establishing this with behavioral testing, but I don’t think this is trivial to establish with behavioral testing. (I happen to think this is pretty doable and easier than ruling out reasons why our tests might be misleading, but this seems nonobvious.)
Perhaps you should specifically call out that this post isn’t explaining how to do this testing and this could be hard.
(This is related to Eliezer’s comment and your response to that comment.)
I don’t expect this testing to be hard (at least conceptually, though doing this well might be expensive), and I expect it can be entirely behavioral.
See section 6: “The capability condition”
But it isn’t a capabilities condition? Maybe I would be happier if you renamed this section.
To be clear, I think there are important additional considerations related to the fact that we don’t just care about capabilities that aren’t covered in that section, though that section is not that far from what I would say if you renamed it to “behavioral tests”, including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.