I mainly mean it seems useful to understand why it made it harder to extract legible evidence of misalignment in the rare case where we do have it (and have the supporting ablations). Again I’d note this is a regression case, so it’s not simply worse at extracting certain types of information than others, it’s worse than a baseline where this intervention wasn’t applied. Confession of misalignment in the rare cases where we can elicit it seems like an important (and apparently fragile based on this limited case) indicator, and I’d be fairly worried that if there is a differential effect here (ex: in extreme case imagine confessions training works well for reward hacking but makes models less likely to reveal other misalignment) it’s not directly in the way of capabilities (in the same way reward hacking is) so it’ll be easy to avoid studying before applying this at scale.
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior—for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.
I mainly mean it seems useful to understand why it made it harder to extract legible evidence of misalignment in the rare case where we do have it (and have the supporting ablations). Again I’d note this is a regression case, so it’s not simply worse at extracting certain types of information than others, it’s worse than a baseline where this intervention wasn’t applied. Confession of misalignment in the rare cases where we can elicit it seems like an important (and apparently fragile based on this limited case) indicator, and I’d be fairly worried that if there is a differential effect here (ex: in extreme case imagine confessions training works well for reward hacking but makes models less likely to reveal other misalignment) it’s not directly in the way of capabilities (in the same way reward hacking is) so it’ll be easy to avoid studying before applying this at scale.
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior—for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.