Thank you for providing this valuable synthesis! In reference to:
4. It is a form of movement building.
Based on my personal interactions with more experienced academics, many seem to view the objectives of mechanistic interpretability as overly ambitious (see Footnote 3 in https://distill.pub/2020/circuits/zoom-in/ as an example). This perception may deter them from engaging in interpretability research. In general, it appears that advancements in capabilities are easier to achieve than alignment improvements. Together with the emphasis on researcher productivity in the ML field, as measured by factors like h-index, academics are incentivized to select more promising research areas.
By publishing mechanistic interpretability work, I think the perception of the field’s difficulty can be changed for the better, thereby increasing the safety/capabilities ratio of the ML community’s output. As the original post acknowledges, this approach could have negative consequences for various reasons.
Thank you for providing this valuable synthesis! In reference to:
Based on my personal interactions with more experienced academics, many seem to view the objectives of mechanistic interpretability as overly ambitious (see Footnote 3 in https://distill.pub/2020/circuits/zoom-in/ as an example). This perception may deter them from engaging in interpretability research. In general, it appears that advancements in capabilities are easier to achieve than alignment improvements. Together with the emphasis on researcher productivity in the ML field, as measured by factors like h-index, academics are incentivized to select more promising research areas.
By publishing mechanistic interpretability work, I think the perception of the field’s difficulty can be changed for the better, thereby increasing the safety/capabilities ratio of the ML community’s output. As the original post acknowledges, this approach could have negative consequences for various reasons.