I had discussed this with Bartosz and this is our best guess:
The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.
However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.
I had discussed this with Bartosz and this is our best guess:
The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.
However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.
Thanks, makes sense!
It might be worth someone (maybe me?) spending some time to improve SSC internalization scores. Also could be that other models are a better fit—e.g. I found Qwen3-32B can ~perfectly decode the base64 side contraints https://www.lesswrong.com/posts/5F6ncBfjh2Bxnm6CJ/base64bench-how-good-are-llms-at-base64-and-why-care-about?commentId=wE3TAQmZyd4nWBsTX