I agree it’s a good post, and it does take guts to tell people when you think that a research direction that you’ve been championing hard actually isn’t the Holy Grail. This is a bit of a nitpick but not insubstantial:
Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.
This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard
I agree it’s a good post, and it does take guts to tell people when you think that a research direction that you’ve been championing hard actually isn’t the Holy Grail. This is a bit of a nitpick but not insubstantial:
Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.
This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard
Whoops, yes, thanks, edited.