Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don’t think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
First, is that it’s under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
Second, there’s no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models’ behaviors. Ideally, we’d want SAEs to be competitive with them. Unfortunately, good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines.
Do you not consider the steering examples in the recent paper to be a practical task, or do you think that competitiveness hasn’t been demonstrated (because people were already doing activation steering without SAEs)? My understanding of the case for activation steering with unsupervisedly-learned features is that it could circumvent some failure modes of RLHF.
Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don’t think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.
First, is that it’s under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.
Second, there’s no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to models’ behaviors. Ideally, we’d want SAEs to be competitive with them. Unfortunately, good comparisons would be hard because using SAEs for editing models is a pretty unique method with lots of compute required upfront. This would make it non-straightforward to compare the difficulty of making different changes with different methods, but it does not obviate the need for baselines.