a lot of interpretability work that performs act-add like ablations to confirm that their directions are real
Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?
ITI is basically act adds but they compute act adds with many examples instead of just a pair
Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).
Minor clarifying point: Act-adds cannot be cast as ablations.
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It’s possible there’s a better or standard word that I can’t think of write now.
Also, another example of an attempt at interp → alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?
Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It’s possible there’s a better or standard word that I can’t think of write now.
Also, another example of an attempt at interp → alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
Sure, though it seems too general or common to use a long word for it?
Maybe “linear intervention”?
We called it vector arithmetic in my Othello paper?
Oh, I like that one! Going to use it from now on