Hey, thanks for the comment! We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.
At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and “real” deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.
Hey, thanks for the comment!
We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.
At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and “real” deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.