Gerard Boxo comments on Are Sparse Autoencoders a good idea for AI control?

Gerard Boxo 27 Jan 2025 11:31 UTC
1 point
0
Hey, thanks for the comment!
We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.

At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and “real” deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.