I expect that in settings where models are intrinsically motivated to hide their reasoning as opposed to being instructed to do so, they perform significantly better. I have some results suggesting this, which I’d be happy to share…
I’d be very interested if you could briefly describe these results!
One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
I’d be very interested if you could briefly describe these results!
One of the results is that after you convince Claude 3.5 Sonnet that its scratchpad is visible, it stops reasoning about alignment faking out loud while having the same compliance gap between training and not training. If you simply instruct it to hide its reasoning, it does no better than in the normal setting. If you give it a lot of assistance (i.e. telling it what it should obfuscate its reasoning as to fool the monitor, which also helps the model by making the alignment faking reasoning more salient), it reduces the rate of verbalized reasoning somewhat but still only half as well as convincing it its scratchpad is visible.
I should have a paper on this and related results public by the end of the month—happy to send an early draft if you’re interested, though it doesn’t contain all the results I’d like yet.
(Sorry for the late reply!)