Another consideration: mechanistic interpretability might resolve some longstanding disagreements about the nature and difficulty of alignment, but not make the actual problem any easier. So, in the world where alignment turns out to be hard, interpretability research might make AI governance more effective and more possible.
I think this is a consideration mostly in favor of doing and publishing interpretability research, at least for now. But if interpretability ever advances to the point where interpretability results on their own are enough to update previously-skeptical LWers towards higher p(doom), it’s probably worth re-evaluating.
Another consideration: mechanistic interpretability might resolve some longstanding disagreements about the nature and difficulty of alignment, but not make the actual problem any easier. So, in the world where alignment turns out to be hard, interpretability research might make AI governance more effective and more possible.
I think this is a consideration mostly in favor of doing and publishing interpretability research, at least for now. But if interpretability ever advances to the point where interpretability results on their own are enough to update previously-skeptical LWers towards higher p(doom), it’s probably worth re-evaluating.