Finetuned a SAE on deceptive/non deceptive reasoning traces from Gemma 9b
If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to ‘real’ deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?
Hey, thanks for the comment! We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.
At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and “real” deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.
If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to ‘real’ deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?
Hey, thanks for the comment!
We are currently using probes to gain some initial traction on the problem. However, the unsupervised nature of SAE labels seems to better align with the broader application of activation space methods for AI control.
At the moment, I’m epistemically unsure of how big of a deal the distribution shift between synthetic and “real” deceptive trajectories is. Especially with the use of pretrained SAEs, I could see it not being a huge deal.