Neel Nanda comments on Interpretability Externalities Case Study—Hungry Hungry Hippos

Neel Nanda 2 Oct 2023 13:35 UTC
LW: 6 AF: 4
2
AF
Note that the previous post on this topic from Lawrence and Marius includes public comments about interp externalities from several people including me

The rough TLDR of my opinion is that interp seems reasonably far from being dangerous to capabilities, and there’s a ton of work to be done, and I think the value of building a thriving academic field around it seems super high, and I think significant barriers to publishing and sharing work is toxic to this, and IMO this consideration dominates over some capabilities externalities. I also think that getting capabilities people to pivot from whatever they’re doing now to mech interp seems like a win, it’s not clear this should be more capabilities-y than the counterfactual, but clearly better for alignment in expectation. I also think junior people get way too stressed about this and should chill out, since your first piece of work is highly unlikely to have any effects, positive or negative
What links here?
- LawrenceC's comment on LawrenceC’s Shortform by LawrenceC (20 Mar 2024 2:42 UTC; 52 points)