Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
Our established methods might no longer work, and we might have too little time to develop new ones
The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.
I lay out my thoughts on this in section 6.6 of the deepmind AGI safety approach. I broadly agree the guarantees seem pretty doomed. But I also think guarantees just seem obviously doomed by all methods because neural networks are far too complex and we’ll need to have plans that do not depend on them. I think there are much more realistic use cases for interpretability than aiming for guarantees
I’ve talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I’ve been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that’ll be an exercise for readers.
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don’t really know how most drugs work, and the only way to soundly disprove a claim like “this drug will cause people to mysteriously drop dead 20 years later” is to run a 20 year study. We approve new drugs in less than 20 years, and we haven’t mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
No specific plans
The people I’ve talked to believe interp will be generally helpful for all types of plans, and I haven’t heard anything specific either. Here’s a specific plan I made up. Hopefully it doesn’t suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we’ll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Future architectures might be different
Transformers haven’t changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it’s somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.
Though I have updates somewhat since then—I am slightly more enthusiastic about weak forms of the linear representation hypothesis, but LESS confident that we’ll learn any useful algorithms through mech interp (because I’ve seen how tricky it is to find algorithms we already know in simple transformers).
Reasons for my pessimism about mechanistic interpretability.
Epistemic status: I’ve noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.
Interpretability just seems harder than people expect
For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are “the solution” that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can’t really afford many of them.
So far mech interp results have “exists” quantifier. We might need “all” for safety.
What we have: “This is a statement about horses, and you can see that this horse feature here is active”
What we need: “Here is how we know whether a statement is about horses or not”
Note that these things might be pretty far away, consider e.g. “we know this substance causes cancer” and “we can tell for an arbitrary substance whether it causes cancer or not”.
No specific plans for how mech interp will help
Or maybe there are some and I don’t know them?
Anyway, I feel people often say something like “If we find the deception feature, we’ll know whether models are lying to us, therefore solving deceptive alignment”. This makes sense, but how will we know whether we’ve really found the deception feature?
I think this is related to the previous point (about exists/all quantifiers): we don’t really know how to build mech interp tools that give some guarantees, so it’s hard to imagine what such solution would look like.
Future architectures might make interpretability harder
I think ASI probably won’t be a simple non-recurrent transformer. No strong justification here—just the very rough “It’s unlikely we found something close to the optimal architecture so early”. This leads to two problems:
Our established methods might no longer work, and we might have too little time to develop new ones
The new architecture will likely be more complex, and thus mech interp might get harder
Is there a good reason to believe interpretability of future systems will be possible?
The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There’s no guarantee that this will hold for the future, more efficient, systems—they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we’ll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.
----
That’s it. Critical comments are very welcome.
I lay out my thoughts on this in section 6.6 of the deepmind AGI safety approach. I broadly agree the guarantees seem pretty doomed. But I also think guarantees just seem obviously doomed by all methods because neural networks are far too complex and we’ll need to have plans that do not depend on them. I think there are much more realistic use cases for interpretability than aiming for guarantees
I’ve talked to a lot of people about mech interp so I can enumerate some counterarguments. Generally I’ve been surprised by how well people in AI safety can defend their own research agendas. Of course, deciding whether the counterarguments outweigh your arguments is a lot harder than just listing them, so that’ll be an exercise for readers.
I think researchers already believe this. Recently I read https://www.darioamodei.com/post/the-urgency-of-interpretability, and in it, Dario expects mech interp to take 5-10 years before it’s as good as an MRI.
Forall quantifiers are nice, but a lot of empirical sciences like medicine or economics have been pretty successful without them. We don’t really know how most drugs work, and the only way to soundly disprove a claim like “this drug will cause people to mysteriously drop dead 20 years later” is to run a 20 year study. We approve new drugs in less than 20 years, and we haven’t mysteriously dropped dead yet.
Similarly we can do a lot in mech interp to build confidence without any forall quantifiers, like building deliberately misaligned models and seeing if mech interp techniques can find the misalignment.
The people I’ve talked to believe interp will be generally helpful for all types of plans, and I haven’t heard anything specific either. Here’s a specific plan I made up. Hopefully it doesn’t suck.
Basically just combine prosaic alignment and mech interp. This might sound stupid on paper, most forbidden technique and whatnot, but using mech interp we can continuously make misalignment harder and keep it above the capability levels of frontier AIs. This might not work long term, but all long term alignment plans seem like moonshots right now, and we’ll have much better ideas later on when we know more about AIs (e.g. after we solve mech interp!).
Transformers haven’t changed much in the past 7 years, and big companies have already invested a ton of money into transformer specific performance optimizations. I just talked to some guys at a startup who spent hundreds of millions building a chip that can only run transformer inference. I think lots of people believe transformers will be around for a while. Also, it’s somewhat of a self fulfilling prophecy because new architectures now have to compete against hyperoptimized transformers, not just regular transformers.
I’ve been saying this for two years: https://www.lesswrong.com/posts/RTmFpgEvDdZMLsFev/mechanistic-interpretability-is-being-pursued-for-the-wrong
Though I have updates somewhat since then—I am slightly more enthusiastic about weak forms of the linear representation hypothesis, but LESS confident that we’ll learn any useful algorithms through mech interp (because I’ve seen how tricky it is to find algorithms we already know in simple transformers).