One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
(Sorry for the late reply!)