Adam Karvonen comments on Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen 20 Dec 2025 18:50 UTC
3 points
0
I had discussed this with Bartosz and this is our best guess:

The SSC model doesn’t perfectly learn the task—on held out side constraints, it gets an internalization score of ~50% (Figure 3 in Cywinski et al., untrained baseline of ~15%). Thus, when we collect the activations from the SSC prompt, the information isn’t fully present in the activations, setting some ceiling on maximum performance.

However, the SSC model has learned a base64 decoding skill, so when we prompt the SSC model with adversarial prompts, it is able to use its learned base64 decoding ability to fully decode the side constraint.
- Oliver Daniels 21 Dec 2025 16:25 UTC
  1 point
  0
  Parent
  Thanks, makes sense!
  
  It might be worth someone (maybe me?) spending some time to improve SSC internalization scores. Also could be that other models are a better fit—e.g. I found Qwen3-32B can ~perfectly decode the base64 side contraints https://www.lesswrong.com/posts/5F6ncBfjh2Bxnm6CJ/base64bench-how-good-are-llms-at-base64-and-why-care-about?commentId=wE3TAQmZyd4nWBsTX