LLMs (and probably most NNs) have lots of meaningfull, interpretable linear feature directions. These can be found though various unsupervised methods (e.g. SAEs) and supervised methods (e.g. linear probs).
However, most human interpretable features, are not what I would call the models true features.
If you find the true features, the network should look sparse and modular, up to noise factors. If you find the true network decomposition, than removing the what’s left over should imporve performance, not make it wors.
Because the network has a limited number of orthogonal directions, there will be interference terms that the network would like to remove, but can’t. A real network decomposition will be everything except this noise.
This is what I think mech-interp should be looking for
It’s possible that I’m wrong and there is no such thing as “the networks true features”. But we’ve (humans colectivly) only just started this reaserch agenda. The fact that we haven’t found it yet, is not much evidence either way.
Reality has a surprising amount of detail[1]. If the training objective is improved by better modeling the world, and the model is does not have enough parameters to capture all of the things about the world which would help reduce loss, the model will learn lots of the incidental complexities of the world. As a concrete example, I can ask something like
What is the name of the stadium in Rome at the confluence of two rivers, next to the River Walk Mariott? Answer from memory.
and the current frontier models know enough about the world that they can, without tools or even any substantial chain of thought, correctly answer that trick question[2]. To be able to answer questions like this from memory, models have to know lots of geographical details about the world.
Unless your technique for extracting a sparse modular world model produces a resulting world model which is larger than the model it came from, I think removing the things which are noise according to your sparse modular model will almost certainly hurt performance on factual recall tasks like this one.
The trick is that there is second city named Rome in the United States, in the state of Georgia. Both Romes contain a confluence of two rivers, both contain river walks, both contain Mariotts, both contain stadiums, but only the Rome in the US contains a stadium at the confluence of two rivers next to a Mariott named for its proximity to the river.
I do exect some amount of superpossition, i.e. the model is using almost orthogonal directions to encode more concept than it has neurons. Depending on what you mean by “larger” this will result in a world model that is larger than the network. However such an encoding will also result in noise. Superpossition will nessesarely lead to unwanted small amplitude connections between uncorelated concepts. Removing these should imporve performance, and if it dosn’t it means that you did the decomposition wrong.
LLMs (and probably most NNs) have lots of meaningfull, interpretable linear feature directions. These can be found though various unsupervised methods (e.g. SAEs) and supervised methods (e.g. linear probs).
However, most human interpretable features, are not what I would call the models true features.
If you find the true features, the network should look sparse and modular, up to noise factors. If you find the true network decomposition, than removing the what’s left over should imporve performance, not make it wors.
Because the network has a limited number of orthogonal directions, there will be interference terms that the network would like to remove, but can’t. A real network decomposition will be everything except this noise.
This is what I think mech-interp should be looking for
It’s possible that I’m wrong and there is no such thing as “the networks true features”. But we’ve (humans colectivly) only just started this reaserch agenda. The fact that we haven’t found it yet, is not much evidence either way.
this is basically what the circuit sparsity agenda is aiming for.
This paper is on my to-read list :)
Reality has a surprising amount of detail[1]. If the training objective is improved by better modeling the world, and the model is does not have enough parameters to capture all of the things about the world which would help reduce loss, the model will learn lots of the incidental complexities of the world. As a concrete example, I can ask something like
and the current frontier models know enough about the world that they can, without tools or even any substantial chain of thought, correctly answer that trick question[2]. To be able to answer questions like this from memory, models have to know lots of geographical details about the world.
Unless your technique for extracting a sparse modular world model produces a resulting world model which is larger than the model it came from, I think removing the things which are noise according to your sparse modular model will almost certainly hurt performance on factual recall tasks like this one.
See the essay by that name for some concrete examples.
The trick is that there is second city named Rome in the United States, in the state of Georgia. Both Romes contain a confluence of two rivers, both contain river walks, both contain Mariotts, both contain stadiums, but only the Rome in the US contains a stadium at the confluence of two rivers next to a Mariott named for its proximity to the river.
I do exect some amount of superpossition, i.e. the model is using almost orthogonal directions to encode more concept than it has neurons. Depending on what you mean by “larger” this will result in a world model that is larger than the network. However such an encoding will also result in noise. Superpossition will nessesarely lead to unwanted small amplitude connections between uncorelated concepts. Removing these should imporve performance, and if it dosn’t it means that you did the decomposition wrong.
Why do you think finding the true features should make the network look sparse and modular?
You might want to look at “Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations” — they’re trying to optimize for what you describe, and apparently had some success. Crosscoders are another attempt.