I agree that confessions right now are better at eliciting what the model did than why it did it. Generally the reason for why a model did something is hard to determine without doing a casual intervention (eg running the same prompt without telling the model that good performance will lead to non deployment).
Indeed, while in this case the model might know the reason, it seems that in some other cases models can have “implicit bias” and unclear if they know all the factors that impacted the output.
I don’t think it’s a barrier to training for confessions on all RL environments but it could be a barrier for extracting certain types of information from confessions.
I mainly mean it seems useful to understand why it made it harder to extract legible evidence of misalignment in the rare case where we do have it (and have the supporting ablations). Again I’d note this is a regression case, so it’s not simply worse at extracting certain types of information than others, it’s worse than a baseline where this intervention wasn’t applied. Confession of misalignment in the rare cases where we can elicit it seems like an important (and apparently fragile based on this limited case) indicator, and I’d be fairly worried that if there is a differential effect here (ex: in extreme case imagine confessions training works well for reward hacking but makes models less likely to reveal other misalignment) it’s not directly in the way of capabilities (in the same way reward hacking is) so it’ll be easy to avoid studying before applying this at scale.
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior—for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.
I agree that confessions right now are better at eliciting what the model did than why it did it. Generally the reason for why a model did something is hard to determine without doing a casual intervention (eg running the same prompt without telling the model that good performance will lead to non deployment).
Indeed, while in this case the model might know the reason, it seems that in some other cases models can have “implicit bias” and unclear if they know all the factors that impacted the output.
I don’t think it’s a barrier to training for confessions on all RL environments but it could be a barrier for extracting certain types of information from confessions.
I mainly mean it seems useful to understand why it made it harder to extract legible evidence of misalignment in the rare case where we do have it (and have the supporting ablations). Again I’d note this is a regression case, so it’s not simply worse at extracting certain types of information than others, it’s worse than a baseline where this intervention wasn’t applied. Confession of misalignment in the rare cases where we can elicit it seems like an important (and apparently fragile based on this limited case) indicator, and I’d be fairly worried that if there is a differential effect here (ex: in extreme case imagine confessions training works well for reward hacking but makes models less likely to reveal other misalignment) it’s not directly in the way of capabilities (in the same way reward hacking is) so it’ll be easy to avoid studying before applying this at scale.
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior—for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.