LawrenceC comments on Interpretability Externalities Case Study—Hungry Hungry Hippos

LawrenceC 21 Sep 2023 21:19 UTC
LW: 4 AF: 2
0
AF
Minor clarifying point: Act-adds cannot be cast as ablations.
Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It’s possible there’s a better or standard word that I can’t think of write now.
Also, another example of an attempt at interp → alignment would arguably be the model editing stuff following causal tracing in the ROME paper?
- Neel Nanda 2 Oct 2023 10:26 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
  - LawrenceC 2 Oct 2023 19:43 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Sure, though it seems too general or common to use a long word for it?
    Maybe “linear intervention”?
    - Neel Nanda 2 Oct 2023 20:43 UTC
      LW: 2 AF: 2
      0
      AF Parent
      We called it vector arithmetic in my Othello paper?
      - LawrenceC 4 Oct 2023 19:00 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Oh, I like that one! Going to use it from now on