Minor clarifying point: Act-adds cannot be cast as ablations.
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It’s possible there’s a better or standard word that I can’t think of write now.
Also, another example of an attempt at interp → alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It’s possible there’s a better or standard word that I can’t think of write now.
Also, another example of an attempt at interp → alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
Sure, though it seems too general or common to use a long word for it?
Maybe “linear intervention”?
We called it vector arithmetic in my Othello paper?
Oh, I like that one! Going to use it from now on