Research thrives on answering important questions. However, the trouble with interpretability for AI safety is that there are no important questions getting answered. Typically the real goal is to understand the neural networks well enough to know whether they are scheming, but that’s a threefold bad idea:
You cannot make incremental progress on it; either you know whether they are scheming, or you don’t
Scheming is not the main AI danger/x-risk
Interpretability is not a significant bottleneck in detecting scheming (we don’t even have good, accessible examples of contexts where AI is applied and scheming would be a huge risk)
To solve this, people substitute various goals, e.g. predictive accuracy, under the assumption that incremental predictive accuracy is helpful. But we already have perfectly adequate ways of predicting the behavior of the neural networks, it’s called running the neural networks.
Research thrives on answering important questions. However, the trouble with interpretability for AI safety is that there are no important questions getting answered. Typically the real goal is to understand the neural networks well enough to know whether they are scheming, but that’s a threefold bad idea:
You cannot make incremental progress on it; either you know whether they are scheming, or you don’t
Scheming is not the main AI danger/x-risk
Interpretability is not a significant bottleneck in detecting scheming (we don’t even have good, accessible examples of contexts where AI is applied and scheming would be a huge risk)
To solve this, people substitute various goals, e.g. predictive accuracy, under the assumption that incremental predictive accuracy is helpful. But we already have perfectly adequate ways of predicting the behavior of the neural networks, it’s called running the neural networks.