Thane Ruthenis comments on Misrepresentation as a Barrier for Interp (Part I)

Thane Ruthenis 29 Apr 2025 19:15 UTC
4 points
0
Off the top of my head, not very well-structured:
It seems the core thing we want our models to handle here is the concept of approximation errors, no? The “horse” symbol has mutual information with the approximation of a horse; the Santa Claus feature existing corresponds to something approximately like Santa Claus existing. The approach of “features are chiseled into the model/mind by an imperfect optimization process to fulfil specific functions” is then one way to start tackling this approximation problem. But it kind of just punts all the difficult parts onto “how the optimization landscape looks like”.
Namely: the needed notion of approximation is pretty tricky to define. What are the labels of the dimensions of the space in which errors are made? What is the “topological picture” of these errors?
We’d usually formalize it as something like “this feature activates on all images within $ϵ$ MSE distance of horse-containing images”. And indeed, that seems to work well for the “horse vs cow-at-night” confusion.
But consider Santa Claus. That feature “denotes” a physical entity. Yet, what it {responds to}/{is formed because of} are not actual physical entities that are approximately similar to Santa Claus, or look like Santa Claus. Rather, it’s a sociocultural phenomenon, which produces sociocultural messaging patterns that are pretty similar to sociocultural messaging patterns which would’ve been generated if Santa Claus existed^[1].
If we consider a child fooled into believing into Santa Claus, what actually happened there is something like:
- “What adults tell you about the physical world” is usually correlated with what actually exists in the physical world.
- The child learns a model that maps adults’ signaling patterns onto world-model states.
- The world-states “Santa Claus exists” and “there’s a grand adult conspiracy to fool you into believing that Santa Claus exists” correspond to fairly similar adult signaling patterns.
  - At this step, we potentially are again working with something neat like “vectors encoding adults’ signaling patterns”, and the notion of similarity is e. g. cosine similarity between those vectors.
- If the child’s approximation error is significant enough to fail to distinguish between those signaling patterns, they pick the feature to learn based on e. g. the simplicity prior, and get fooled.
Going further, consider ghosts. Imagine ghost hunters equipped with a bunch of paranormal-investigation tools. They do some investigating and conclude that their readings are consistent with “there’s a ghost”. The issue isn’t merely that there’s such a small distance between “there’s a ghost” and “there’s no ghost” tool-output-vectors that the former fall within the approximation error of the latter. The issue is that the ghost hunters learned a completely incorrect model in which some tool-outputs which don’t, in reality, correspond to ghosts existing, are mapped to ghosts existing.
Which, in turn, presumably happened because they’d previously confused the sociocultural messaging pattern of “tons of people are fooled into thinking these tools work” with “these tools work”.
Which sheds some further light at the Santa Claus example too. Our sociocultural messaging about Santa Claus is not actually similar to the messaging in the counterfactual where Santa Claus really existed^[2]. It’s only similar in the deeply incomplete children’s models of how those messaging patterns work...
Summing up, I think a merely correlational definition can still be made to work, as long as you:
- Assume that feature vectors activate in response to approximations of their referents.
- Assume that the approximation errors can lie in learned abstract encodings (“ghost-hunting tools’ outputs”, “sociocultural messaging patterns”), not only in default encodings (“token embeddings of words”), with likewise-learned custom similarity metrics.
- Assume that learned abstract encodings can themselves be incorrect, due to approximation errors in previously learned encodings, in a deeply compounding way...
  - … such that some features don’t end up “approximately corresponding” to any actually existing phenomenon. (Like, the ground-truth prediction error between similar-sounding models is unbounded in the general case: approximately similar-sounding models don’t produce approximately similarly correct predictions. Or even predictions that live in the same sample space.)
… Or something like that.
1. ^
  Well, not quite, but bear with me.
2. ^
  E. g., there’d be more fear about the eldritch magical entity manipulating our children.