Explainable AI and interpretable ML research and methods, aside from the researchers affiliated with the rationalist scene, are for some reason excluded from the narrative. Is it really your view that ‘mechanistic interpretability’ is so different that it is an entirely different field? Doesn’t it seem a bit questionable that the term ‘mechanistic interpretability’ was coined in order to distance Olah’s research from other explanation approaches that had been found to have fundamental weaknesses—especially when mechanistic interpretability methods repeatedly fall prey to the exact same points of failure? The failure of SDL latents was unsurprising, the fact that it took such a long time for someone to call attention to it should have provoked much more of a discussion on how science is done in this community.
I agree with the similarities to neuroscience, and there is definitely much to learn from that field, but it would be an even easier step to just read a little more widely on interpretable/explainable machine learning and causal discovery, in which there is a wide body of literature discussing the very issues you mention and more. Why is research done outside of the self-labelled ‘mechanistic interpretability’ community mostly ignored? In neuroscience, if you prefer though, perhaps Jonas & Kording 2017 is relevant: Could a Neuroscientist Understand a Microprocessor? | PLOS Computational Biology https://share.google/WYGmCXAnX8FNbaRqi
Explainable AI and interpretable ML research and methods, aside from the researchers affiliated with the rationalist scene, are for some reason excluded from the narrative. Is it really your view that ‘mechanistic interpretability’ is so different that it is an entirely different field? Doesn’t it seem a bit questionable that the term ‘mechanistic interpretability’ was coined in order to distance Olah’s research from other explanation approaches that had been found to have fundamental weaknesses—especially when mechanistic interpretability methods repeatedly fall prey to the exact same points of failure? The failure of SDL latents was unsurprising, the fact that it took such a long time for someone to call attention to it should have provoked much more of a discussion on how science is done in this community.
I agree with the similarities to neuroscience, and there is definitely much to learn from that field, but it would be an even easier step to just read a little more widely on interpretable/explainable machine learning and causal discovery, in which there is a wide body of literature discussing the very issues you mention and more. Why is research done outside of the self-labelled ‘mechanistic interpretability’ community mostly ignored? In neuroscience, if you prefer though, perhaps Jonas & Kording 2017 is relevant: Could a Neuroscientist Understand a Microprocessor? | PLOS Computational Biology https://share.google/WYGmCXAnX8FNbaRqi