Cleo Nardo comments on Can SAE steering reveal sandbagging?

Cleo Nardo 15 Apr 2025 13:05 UTC
3 points
0
- Clymer et al. injected noise into the activations of Llama-13B and was able to correctly classify models that were faking alignment in 98% of model pairs (out of 324 combinations of pairs).
What’s the apples-to-apples comparison? i.e. recovery_rate after steering with a random vector at layer 50 of llama-3.3-70B-instruct on this particular dataset