Understanding deep learning isn’t a leaderboard sport—handle with care.
Saliency maps, neuron dissection, sparse autoencoders—each surged on hype, then stalled[1] when follow‑up work showed the insight was mostly noise, easily spoofed, or valid only in cherry‑picked settings. That risks being negative progress: we spend cycles debunking ghosts instead of building cumulative understanding.
The root mismatch is methodological. Mainstream ML capabilities research enjoys a scientific luxury almost no other field gets: public, quantitative benchmarks that tie effort to ground truth. ImageNet accuracy, MMLU, SWE‑bench—one number silently kills bad ideas. With that safety net, you can iterate fast on weak statistics and still converge on something useful. Mechanistic interpretability has no scoreboard for “the network’s internals now make sense.” Implicitly inheriting benchmark‑reliant habits from mainstream ML therefore swaps a ruthless filter for a fog of self‑deception.
How easy is it to fool ourselves? Recall the “Could a neuroscientist understand a microprocessor?” study: standard neuroscience toolkits—ablation tests, tuning curves, dimensionality reduction—were applied to a 6502 chip whose ground truth is fully known. The analyses produced plausible‑looking stories that entirely missed how the processor works. Interpretability faces the same trap: shapely clusters or sharp heat‑maps can look profound until a stronger test dissolves them.
What methodological standard should replace the leaderboard? Reasonable researchers will disagree[2]. Borrowing from mature natural sciences like physics or neuroscience seems like a sensible default, but a proper discussion is beyond this note. The narrow claim is simpler:
Because no external benchmark will catch your mistakes, you must design your own guardrails. Methodology for understanding deep learning is an open problem, not a hand‑me‑down from capabilities work.
So, before shipping the next clever probe, pause and ask: Where could I be fooling myself, and what concrete test would reveal it? If you don’t have a clear answer, you may be sprinting without the safety net this methodology assumes—and that’s precisely when caution matters most.
This is perhaps a bit harsh—I think SAEs for instance still might hold some promise, and neuron-based analysis still has its place, but I think it’s fair to say the hype got quite ahead of itself.
Exactly how much methodological caution is warranted here will obviously be a point of contention. Everyone thinks the people going faster than them are reckless and the people going slower are needlessly worried. My point here is just to think actively about the question—don’t just blindly inherit standards from ML.
Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.
Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.
Research thrives on answering important questions. However, the trouble with interpretability for AI safety is that there are no important questions getting answered. Typically the real goal is to understand the neural networks well enough to know whether they are scheming, but that’s a threefold bad idea:
You cannot make incremental progress on it; either you know whether they are scheming, or you don’t
Scheming is not the main AI danger/x-risk
Interpretability is not a significant bottleneck in detecting scheming (we don’t even have good, accessible examples of contexts where AI is applied and scheming would be a huge risk)
To solve this, people substitute various goals, e.g. predictive accuracy, under the assumption that incremental predictive accuracy is helpful. But we already have perfectly adequate ways of predicting the behavior of the neural networks, it’s called running the neural networks.
Understanding deep learning isn’t a leaderboard sport—handle with care.
Saliency maps, neuron dissection, sparse autoencoders—each surged on hype, then stalled[1] when follow‑up work showed the insight was mostly noise, easily spoofed, or valid only in cherry‑picked settings. That risks being negative progress: we spend cycles debunking ghosts instead of building cumulative understanding.
The root mismatch is methodological. Mainstream ML capabilities research enjoys a scientific luxury almost no other field gets: public, quantitative benchmarks that tie effort to ground truth. ImageNet accuracy, MMLU, SWE‑bench—one number silently kills bad ideas. With that safety net, you can iterate fast on weak statistics and still converge on something useful. Mechanistic interpretability has no scoreboard for “the network’s internals now make sense.” Implicitly inheriting benchmark‑reliant habits from mainstream ML therefore swaps a ruthless filter for a fog of self‑deception.
How easy is it to fool ourselves? Recall the “Could a neuroscientist understand a microprocessor?” study: standard neuroscience toolkits—ablation tests, tuning curves, dimensionality reduction—were applied to a 6502 chip whose ground truth is fully known. The analyses produced plausible‑looking stories that entirely missed how the processor works. Interpretability faces the same trap: shapely clusters or sharp heat‑maps can look profound until a stronger test dissolves them.
What methodological standard should replace the leaderboard? Reasonable researchers will disagree[2]. Borrowing from mature natural sciences like physics or neuroscience seems like a sensible default, but a proper discussion is beyond this note. The narrow claim is simpler:
So, before shipping the next clever probe, pause and ask: Where could I be fooling myself, and what concrete test would reveal it? If you don’t have a clear answer, you may be sprinting without the safety net this methodology assumes—and that’s precisely when caution matters most.
This is perhaps a bit harsh—I think SAEs for instance still might hold some promise, and neuron-based analysis still has its place, but I think it’s fair to say the hype got quite ahead of itself.
Exactly how much methodological caution is warranted here will obviously be a point of contention. Everyone thinks the people going faster than them are reckless and the people going slower are needlessly worried. My point here is just to think actively about the question—don’t just blindly inherit standards from ML.
Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.
Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.
Research thrives on answering important questions. However, the trouble with interpretability for AI safety is that there are no important questions getting answered. Typically the real goal is to understand the neural networks well enough to know whether they are scheming, but that’s a threefold bad idea:
You cannot make incremental progress on it; either you know whether they are scheming, or you don’t
Scheming is not the main AI danger/x-risk
Interpretability is not a significant bottleneck in detecting scheming (we don’t even have good, accessible examples of contexts where AI is applied and scheming would be a huge risk)
To solve this, people substitute various goals, e.g. predictive accuracy, under the assumption that incremental predictive accuracy is helpful. But we already have perfectly adequate ways of predicting the behavior of the neural networks, it’s called running the neural networks.