TurnTrout comments on Interpretability Externalities Case Study—Hungry Hungry Hippos

TurnTrout 21 Sep 2023 18:17 UTC
LW: 4 AF: 3
0
AF
a lot of interpretability work that performs act-add like ablations to confirm that their directions are real
Minor clarifying point: Act-adds cannot be cast as ablations. Do you mean to say that the interp work uses activation addition to confirm real directions? Or that they use activation ablation/resampling/scrubbing?
ITI is basically act adds but they compute act adds with many examples instead of just a pair
Yup, ITI was developed concurrently, and (IIRC, private correspondence) was inspired by their work on Othello-GPT. So this is another instance of interp leading to an alignment technique (albeit two independent paths leading to a similar technique).
- LawrenceC 21 Sep 2023 21:19 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Minor clarifying point: Act-adds cannot be cast as ablations.
  Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It’s possible there’s a better or standard word that I can’t think of write now.
  Also, another example of an attempt at interp → alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
  - Neel Nanda 2 Oct 2023 10:26 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
    - LawrenceC 2 Oct 2023 19:43 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Sure, though it seems too general or common to use a long word for it?
      Maybe “linear intervention”?
      - Neel Nanda 2 Oct 2023 20:43 UTC
        LW: 2 AF: 2
        0
        AF Parent
        We called it vector arithmetic in my Othello paper?
        LawrenceC 4 Oct 2023 19:00 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Oh, I like that one! Going to use it from now on