Zack_M_Davis comments on EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Zack_M_Davis 21 May 2024 21:24 UTC
LW: 9 AF: 6
0
AF

it seems to me that Anthropic has so far failed to apply its interpretability techniques to practical tasks and show that they are competitive

Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn’t been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.
- scasper 21 May 2024 22:15 UTC
  LW: 24 AF: 13
  9
  AF Parent
  Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don’t think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
  First, is that it’s under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
  Second, there’s no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models’ behaviors. Ideally, we’d want SAEs to be competitive with them. Unfortunately, good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines.