Chris Olah was absolutely not the first person to discuss the idea of using interpretability to better understand the underlying data-generating process. Most statistical modelling has been driven by that aim, and that way of thinking wasn’t just abandoned as ML research progressed—Breiman (of random forest fame) discusses it in his 2001 paper on ‘the two cultures’, for example. While a lot of explainability/interpretability research has focussed on the goal of understanding the model, there has been plenty written and discussed about using those methods for scientific research too, and also plenty of research in various fields trying to do that.
The problem is that believing the results gained when using such methods in this way relies on two assumptions: the interpretations are an accurate reflection of the underlying model, and the model is an accurate reflection of the data-generating process for the phenomenon of interest. I would say that the first assumption is almost definitely invalid and the second is most probably invalid, given poor behaviour of models out-of-distribution (when you would expect a model which has captured causal behaviour to still be performant).
Perhaps you’re already aware of all this, and apologies if so, but the fact that you write Olah was possibly the first to mention this suggests to me that you might not be aware of the existing literature on these topics: if you are interested in more of the issues involved in this avenue of research, look into causal discovery for more rigorous discussion of why learning causal relationships is usually impossible using real-world observational data, and explore more of the literature on the lack of robustness, fragility & challenges of explainable/interpretable ML methods. The literature discussing issues with Shapley/SHAP and feature importances is particularly helpful in terms of explicitly focussing on the connection to causal learning.
Finally, if you do research in this way, be very careful to design the methodology such that it includes ways to falsify your results. Check that the trained model responds in realistic ways when you ablate the input to extreme values, to ensure that the relationships captured are plausibly interesting at all. Include artificial predictors that are correlated with other variables and/or your target variable, but are generated after data collection so do not have a causal effect; if those predictors are identified as important in some way, that can act as a red flag that your results aren’t capturing causal behaviour. Repeat your methodology on different subsets of data, with different models and parameter settings, and ensure the interpretations are robust. Use multiple interpretation methods and compare. Ideally, start with applying your methodology to synthetic data with similar characteristics and a known data-generating process and verify that your results can capture the right relationships in a simulated setting.
Chris Olah was absolutely not the first person to discuss the idea of using interpretability to better understand the underlying data-generating process. Most statistical modelling has been driven by that aim, and that way of thinking wasn’t just abandoned as ML research progressed—Breiman (of random forest fame) discusses it in his 2001 paper on ‘the two cultures’, for example. While a lot of explainability/interpretability research has focussed on the goal of understanding the model, there has been plenty written and discussed about using those methods for scientific research too, and also plenty of research in various fields trying to do that.
The problem is that believing the results gained when using such methods in this way relies on two assumptions: the interpretations are an accurate reflection of the underlying model, and the model is an accurate reflection of the data-generating process for the phenomenon of interest. I would say that the first assumption is almost definitely invalid and the second is most probably invalid, given poor behaviour of models out-of-distribution (when you would expect a model which has captured causal behaviour to still be performant).
Perhaps you’re already aware of all this, and apologies if so, but the fact that you write Olah was possibly the first to mention this suggests to me that you might not be aware of the existing literature on these topics: if you are interested in more of the issues involved in this avenue of research, look into causal discovery for more rigorous discussion of why learning causal relationships is usually impossible using real-world observational data, and explore more of the literature on the lack of robustness, fragility & challenges of explainable/interpretable ML methods. The literature discussing issues with Shapley/SHAP and feature importances is particularly helpful in terms of explicitly focussing on the connection to causal learning.
Finally, if you do research in this way, be very careful to design the methodology such that it includes ways to falsify your results. Check that the trained model responds in realistic ways when you ablate the input to extreme values, to ensure that the relationships captured are plausibly interesting at all. Include artificial predictors that are correlated with other variables and/or your target variable, but are generated after data collection so do not have a causal effect; if those predictors are identified as important in some way, that can act as a red flag that your results aren’t capturing causal behaviour. Repeat your methodology on different subsets of data, with different models and parameter settings, and ensure the interpretations are robust. Use multiple interpretation methods and compare. Ideally, start with applying your methodology to synthetic data with similar characteristics and a known data-generating process and verify that your results can capture the right relationships in a simulated setting.