RogerDearnaley comments on White Box Control at UK AISI—Update on Sandbagging Investigations

RogerDearnaley 24 Jul 2025 17:41 UTC
5 points
2
This consequence of the probe picking up both “deceptiveness” (as in intent) and also “wrongness” might be effectively noising the results.

This strongly suggests trying a more complex probe generation technique that is intended to compensate for this (if it’s in fact the case).

I think it would also be interesting to analyze you probe activations using an SAE for the model they were trained on, and see what that thinks they are a mix of — that seems like it could be informative, and has the advantage that you’re not relying on the SAE directly in operation, only as a source of research insight.
- Joseph Bloom 25 Jul 2025 9:30 UTC
  3 points
  0
  Parent
  I agree!