Finetuned a SAE on deceptive/non deceptive reasoning traces from Gemma 9b
If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to ‘real’ deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?
They have swapped the places of the black’s King and Queen