So basically they ran into the problem of misrepresentation—it seems possible for A to be about B even when B doesn’t exist, for example. If B doesn’t exist, then there is no mutual information / causal stream between it and A! And yet we are able to think about Santa Claus, or think false things like that the world is flat, etc. This capacity to get things wrong seems essential to “real” mental content, and doesn’t seem able to be captured by correlations with some physical process.
If you were to say “Santa Claus is a green furry creature which hates fun”, I think you would be wrong about that. The reason I think you would be wrong about that is that the concept of Santa Claus refers to the cultural context Santa Claus comes from—if you believe that Santa Claus is green and furry you will make bad predictions about what other people will say about Santa Claus. There is mutual information between your beliefs about A (Santa Claus) and the thing those beliefs are about (the cultural idea of Santa Claus).
Likewise, when you have beliefs like “the square of the length of the hypotenuse of a right triangle is equal to the sum of the squares of the lengths of the legs”, this is associated with a bunch of beliefs about what your future observations would be if you wrote math stuff down on paper or did clever things in the world with paper folding.
In the case of a looking at SAE features within a language model, interpretability researchers might say things like “Feature F#34M/31164353 is about the Golden Gate Bridge”, but I think that’s shorthand for a bunch of interlinked predictions about what their future observations would be if they e.g. ablated the feature and then asked a question for which the answer was the Golden Gate Bridge, or amplified the feature and asked the model about random topics.
Historically the cases I’ve found most confusing are those where I’m trying to figure out how a belief is “about” the outside world in a way that is unmediated by my expected future observations. Tentatively, I hypothesize that I have no beliefs that are directly “about” the real world, only beliefs about how my past and current observations will impact my future observations.
If you were to say “Santa Claus is a green furry creature which hates fun”, I think you would be wrong about that. The reason I think you would be wrong about that is that the concept of Santa Claus refers to the cultural context Santa Claus comes from—if you believe that Santa Claus is green and furry you will make bad predictions about what other people will say about Santa Claus. There is mutual information between your beliefs about A (Santa Claus) and the thing those beliefs are about (the cultural idea of Santa Claus).
Likewise, when you have beliefs like “the square of the length of the hypotenuse of a right triangle is equal to the sum of the squares of the lengths of the legs”, this is associated with a bunch of beliefs about what your future observations would be if you wrote math stuff down on paper or did clever things in the world with paper folding.
In the case of a looking at SAE features within a language model, interpretability researchers might say things like “Feature F#34M/31164353 is about the Golden Gate Bridge”, but I think that’s shorthand for a bunch of interlinked predictions about what their future observations would be if they e.g. ablated the feature and then asked a question for which the answer was the Golden Gate Bridge, or amplified the feature and asked the model about random topics.
Historically the cases I’ve found most confusing are those where I’m trying to figure out how a belief is “about” the outside world in a way that is unmediated by my expected future observations. Tentatively, I hypothesize that I have no beliefs that are directly “about” the real world, only beliefs about how my past and current observations will impact my future observations.
(This is much closer to my own take, and is something Steve and I argue about occasionally.)