In adversarial situation in high-dimensional spaces, attack tends to be easier than defense: an attacker can search a very large number of possibilities to find one viable attack, a defender needs to cover all of them. (Cryptography is deliberately engineering the opposite, where attacking is expensive.) So in AI alignment, we should attempt to be the attacker whenever possible, and try to put the scheming AI/model organism on the defense. This suggests approaches like extracting true confessions (as opposed to breaking CoT steganography, which is allowing the model to put us in a cryptographic situation). Extracting true confessions turns out to be rather easy — but if it wasn’t, then applying approaches like automated jailbreaking to it would be an obvious thing to try.
In adversarial situation in high-dimensional spaces, attack tends to be easier than defense: an attacker can search a very large number of possibilities to find one viable attack, a defender needs to cover all of them. (Cryptography is deliberately engineering the opposite, where attacking is expensive.) So in AI alignment, we should attempt to be the attacker whenever possible, and try to put the scheming AI/model organism on the defense. This suggests approaches like extracting true confessions (as opposed to breaking CoT steganography, which is allowing the model to put us in a cryptographic situation). Extracting true confessions turns out to be rather easy — but if it wasn’t, then applying approaches like automated jailbreaking to it would be an obvious thing to try.