Paul W comments on An Ambitious Vision for Interpretability

Paul W 12 Feb 2026 9:59 UTC
1 point
0
Hi! Interesting post!

I have a question: how much are “mechanistic explanations” (whatever they are) expected to be relative to, or valid for, only specific datasets/regions of “data space”?
Could it happen that you can (if you have enough computing power) cover the set of all relevant data with small subsets, such that you have mechanistic explanations valid for each small subset, but no global one?
What would this mean for AI safety? Are there people thinking about this potential issue?

For context: I’m a mathematician, with more experience in algebra than in analysis, currently thinking about this question using sheaves and (hopefully) cohomology.