Fascinatingly, philosopher-of-mind David Chalmers (known for eg the hard problem of consciousness, the idea of p-zombies) has just published a paper on the philosophy of mechanistic interpretability. I’m still reading it, and it’ll probably take me a while to digest; may post more about it at that point. In the meantime this is just a minimal mini-linkpost.
(Abstract) I argue for the importance of propositional interpretability, which involves interpreting a system’s mechanisms and behavior in terms of propositional attitudes ... (Page 5) Propositional attitudes can be divided into dispositional and occurrent. Roughly speaking, occurrent attitudes are those that are active at a given time. (In a neural network, these would be encoded in neural activations.) Dispositional attitudes are typically inactive but can be activated. (In a neural network, these would be encoded in the weights.) For example, I believe Paris is the capital of France even when I am asleep and the belief is not active. That is a dispositional belief. On the other hand, I may actively judge France has won more medals than Australia. That is an occurrent mental state, sometimes described as an “occurrent belief”, or perhaps better, as a “judgment” (so judgments are active where beliefs are dispositional). One can make a similar distinction for desires and other attitudes.
(Page 9) Now, it is likely that a given AI system may have an infinite number of propositional attitudes, in which case a full log will be impossible. For example, if a system believes a proposition p, it arguably dispositionally believes p-or-q for all q. One could perhaps narrow down to a finite list by restricting the log to occurrent propositional attitudes, such as active judgments. Alternatively, we could require the system to log the most significant propositional attitudes on some scale, or to use a search/query process to log all propositional attitudes that meet a certain criterion.
I think what this is showing is that Chalmer’s definition of “dispositional attitudes” has a problem: It lacks any notion of the amount and kind of computational labour required to turn ‘dispositional’ attitudes into ‘occurrent’ ones. That’s why he ends up with AI systems having an uncountably infinite number of dispositional attitudes.
One could try to fix up Chalmer’s definition by making up some notion of computational cost, or circuit complexity or something of the sort, that’s required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff c we are free to pick as applications demand.
But I don’t feel very excited about that. At that point, what is this notion of “dispositional attitudes” really still providing us that wouldn’t be less cumbersome to describe in the language of circuits? There, you don’t have this problem. An AI can have a query-key lookup for proposition p and just not have a query-key lookup for the proposition porq. Instead, if someone asks whether porq is true, it first performs the lookup for p, then uses some general circuits for evaluating simple propositional logic to calculate that porq is true. This is an importantly different computational and mental process from having a query-key lookup for porq in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer’s proposed log of ‘propositional attitudes’ would do this. It’d describe both of these networks the same way, as having a propositional attitude of believing porq, discarding a distinction between them that is important for understanding the models’ mental state in a way that will let us do things such as successfully predicting the models’ behavior in a different situation.
I’m all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don’t matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.
I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he’s pointing at an important problem for interpretability; that it’s not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.
I think what this is showing is that Chalmer’s definition of “dispositional attitudes” has a problem: It lacks any notion of the amount and kind of computational labour required to turn ‘dispositional’ attitudes into ‘occurrent’ ones. That’s why he ends up with AI systems having an uncountably infinite number of dispositional attitudes.
This pretty much matches my sense so far, although I haven’t had time to finish reading the whole thing. I wonder whether this is due to the fact that he’s used to thinking about human brains, where we’re (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.
Clearly ANNs are able to represent propositional content, but I haven’t seen anything that makes me think that’s the natural unit of analysis.
I could imagine his lens potentially being useful for some sorts of analysis built on top of work from mech interp, but not as a core part of mech interp itself (unless it turns out that it happens to be true that propositions and propositional attitudes are the natural decomposition for ANNs, I suppose, but even if that happened it would seem like a happy coincidence rather than something that Chalmers has identified in advance).
Clearly ANNs are able to represent propositional content, but I haven’t seen anything that makes me think that’s the natural unit of analysis.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions. We even routinely communicate by uttering “statements”—which express propositions. So propositions are natural to us, which is why they are important for ANN interpretability.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions.
I’m not too confident of this. It seems to me that a lot of human cognition isn’t particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg ‘I am experiencing a sensation of pain in my big toe’, ‘I am imagining a picnic table’), but that doesn’t particularly seem like the natural lens to view those through.
That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.
I wonder whether this is due to the fact that he’s used to thinking about human brains, where we’re (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.
I don’t think this description is philosophically convenient. Believing p and believing things that imply p are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer’s definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.
Something that Chalmers finds convenient, anyhow. I’m not sure how else we could view ‘dispositional beliefs’ if not as a philosophical construct; surely Chalmers doesn’t imagine that ANNs or human brains actively represent ‘p-or-q’ for all possible q.
To be fair here, from an omniscient perspective, believing P and believing things that imply P are genuinely the same thing in terms of results, but from a non-omniscient perspective, the difference matters.
One part that seems confused to me, on chain-of-thought as an interpretability method:
Another limitation [of chain-of-thought] arises from restricted generality. Chains of thought will typically only serve as a propositional interpretibility for chain-of-thought systems: systems that use chains of thought for reasoning. For systems that do not themselves use chains of thought, any chains of thought that we generate will play no role in the system. Once chains of thought are unmoored from the original system in this way, it is even more unclear why they should reflect it. Of course we could try to find some way to train a non-chain-of-thought system to make accurate reports of its internal states along the way – but that is just the thought-logging problem all over again,and chains of thought will play no special role.
I have no idea what he’s trying to say here. Is he somehow totally unaware that you can use CoT with regular LLMs? That seems unlikely since he’s clearly read some of the relevant literature (eg he cites Turpin et al’s paper on CoT unfaithfulness). I don’t know how else to interpret it, though—maybe I’m just missing something?
Regular LLMs can use chain-of-thought reasoning. He is speaking about generating chains of thought for systems that don’t use them. E.g. AlphaGo, or diffusion models, or even an LLM in cases where it didn’t use CoT but produced the answer immediately.
As an example, you ask an LLM a question, and it answers it without using CoT. Then you ask it to explain how it arrived at its answer. It will generate something for you that looks like a chain of thought. But since it wasn’t literally using it while producing its original answer, this is just an after-the-fact rationalization. It is questionable whether such a post-hoc “chain of thought” reflects anything the model was actually doing internally when it originally came up with the answer. It could be pure confabulation.
Your first paragraph makes sense as an interpretation, which I discounted because the idea of something like AlphaGo doing CoT (or applying a CoT to it) seems so nonsensical, since it’s not at all a linguistic model.
I’m having more trouble seeing how to read what Chalmer says in the way your second paragraph suggests—eg ‘unmoored from the original system’ doesn’t seem like it’s talking about the same system generating an ad hoc explanation. It’s more like he’s talking about somehow taking a CoT generated by one model and applying it to another, although that also seems nonsensical.
If you want to understand why a model, any model, did something, you presumably want a verbal explanation of its reasoning, a chain of thought. E.g. why AlphaGo made its famous unexpected move 37. That’s not just true for language models.
Fascinatingly, philosopher-of-mind David Chalmers (known for eg the hard problem of consciousness, the idea of p-zombies) has just published a paper on the philosophy of mechanistic interpretability. I’m still reading it, and it’ll probably take me a while to digest; may post more about it at that point. In the meantime this is just a minimal mini-linkpost.
Propositional Interpretability in Artificial Intelligence
I don’t like it. It does not feel like a clean natural concept in the territory to me.
Case in point:
I think what this is showing is that Chalmer’s definition of “dispositional attitudes” has a problem: It lacks any notion of the amount and kind of computational labour required to turn ‘dispositional’ attitudes into ‘occurrent’ ones. That’s why he ends up with AI systems having an uncountably infinite number of dispositional attitudes.
One could try to fix up Chalmer’s definition by making up some notion of computational cost, or circuit complexity or something of the sort, that’s required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff c we are free to pick as applications demand.
But I don’t feel very excited about that. At that point, what is this notion of “dispositional attitudes” really still providing us that wouldn’t be less cumbersome to describe in the language of circuits? There, you don’t have this problem. An AI can have a query-key lookup for proposition p and just not have a query-key lookup for the proposition porq. Instead, if someone asks whether porq is true, it first performs the lookup for p, then uses some general circuits for evaluating simple propositional logic to calculate that porq is true. This is an importantly different computational and mental process from having a query-key lookup for porq in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer’s proposed log of ‘propositional attitudes’ would do this. It’d describe both of these networks the same way, as having a propositional attitude of believing porq, discarding a distinction between them that is important for understanding the models’ mental state in a way that will let us do things such as successfully predicting the models’ behavior in a different situation.
I’m all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don’t matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.
I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he’s pointing at an important problem for interpretability; that it’s not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.
Agreed that there’s a problem there, but it’s not at all clear to me (as yet) that Chalmers’ view is a fruitful way to address that problem.
i do agree with that, although ‘step 1 is identify the problem’
This pretty much matches my sense so far, although I haven’t had time to finish reading the whole thing. I wonder whether this is due to the fact that he’s used to thinking about human brains, where we’re (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.
Clearly ANNs are able to represent propositional content, but I haven’t seen anything that makes me think that’s the natural unit of analysis.
I could imagine his lens potentially being useful for some sorts of analysis built on top of work from mech interp, but not as a core part of mech interp itself (unless it turns out that it happens to be true that propositions and propositional attitudes are the natural decomposition for ANNs, I suppose, but even if that happened it would seem like a happy coincidence rather than something that Chalmers has identified in advance).
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions. We even routinely communicate by uttering “statements”—which express propositions. So propositions are natural to us, which is why they are important for ANN interpretability.
I’m not too confident of this. It seems to me that a lot of human cognition isn’t particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg ‘I am experiencing a sensation of pain in my big toe’, ‘I am imagining a picnic table’), but that doesn’t particularly seem like the natural lens to view those through.
That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.
I don’t think this description is philosophically convenient. Believing p and believing things that imply p are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer’s definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.
Something that Chalmers finds convenient, anyhow. I’m not sure how else we could view ‘dispositional beliefs’ if not as a philosophical construct; surely Chalmers doesn’t imagine that ANNs or human brains actively represent ‘p-or-q’ for all possible q.
To be fair here, from an omniscient perspective, believing P and believing things that imply P are genuinely the same thing in terms of results, but from a non-omniscient perspective, the difference matters.
One part that seems confused to me, on chain-of-thought as an interpretability method:
I have no idea what he’s trying to say here. Is he somehow totally unaware that you can use CoT with regular LLMs? That seems unlikely since he’s clearly read some of the relevant literature (eg he cites Turpin et al’s paper on CoT unfaithfulness). I don’t know how else to interpret it, though—maybe I’m just missing something?
Regular LLMs can use chain-of-thought reasoning. He is speaking about generating chains of thought for systems that don’t use them. E.g. AlphaGo, or diffusion models, or even an LLM in cases where it didn’t use CoT but produced the answer immediately.
As an example, you ask an LLM a question, and it answers it without using CoT. Then you ask it to explain how it arrived at its answer. It will generate something for you that looks like a chain of thought. But since it wasn’t literally using it while producing its original answer, this is just an after-the-fact rationalization. It is questionable whether such a post-hoc “chain of thought” reflects anything the model was actually doing internally when it originally came up with the answer. It could be pure confabulation.
Your first paragraph makes sense as an interpretation, which I discounted because the idea of something like AlphaGo doing CoT (or applying a CoT to it) seems so nonsensical, since it’s not at all a linguistic model.
I’m having more trouble seeing how to read what Chalmer says in the way your second paragraph suggests—eg ‘unmoored from the original system’ doesn’t seem like it’s talking about the same system generating an ad hoc explanation. It’s more like he’s talking about somehow taking a CoT generated by one model and applying it to another, although that also seems nonsensical.
If you want to understand why a model, any model, did something, you presumably want a verbal explanation of its reasoning, a chain of thought. E.g. why AlphaGo made its famous unexpected move 37. That’s not just true for language models.
Sure, I agree that would be useful.