(Abstract) I argue for the importance of propositional interpretability, which involves interpreting a system’s mechanisms and behavior in terms of propositional attitudes ... (Page 5) Propositional attitudes can be divided into dispositional and occurrent. Roughly speaking, occurrent attitudes are those that are active at a given time. (In a neural network, these would be encoded in neural activations.) Dispositional attitudes are typically inactive but can be activated. (In a neural network, these would be encoded in the weights.) For example, I believe Paris is the capital of France even when I am asleep and the belief is not active. That is a dispositional belief. On the other hand, I may actively judge France has won more medals than Australia. That is an occurrent mental state, sometimes described as an “occurrent belief”, or perhaps better, as a “judgment” (so judgments are active where beliefs are dispositional). One can make a similar distinction for desires and other attitudes.
(Page 9) Now, it is likely that a given AI system may have an infinite number of propositional attitudes, in which case a full log will be impossible. For example, if a system believes a proposition p, it arguably dispositionally believes p-or-q for all q. One could perhaps narrow down to a finite list by restricting the log to occurrent propositional attitudes, such as active judgments. Alternatively, we could require the system to log the most significant propositional attitudes on some scale, or to use a search/query process to log all propositional attitudes that meet a certain criterion.
I think what this is showing is that Chalmer’s definition of “dispositional attitudes” has a problem: It lacks any notion of the amount and kind of computational labour required to turn ‘dispositional’ attitudes into ‘occurrent’ ones. That’s why he ends up with AI systems having an uncountably infinite number of dispositional attitudes.
One could try to fix up Chalmer’s definition by making up some notion of computational cost, or circuit complexity or something of the sort, that’s required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff c we are free to pick as applications demand.
But I don’t feel very excited about that. At that point, what is this notion of “dispositional attitudes” really still providing us that wouldn’t be less cumbersome to describe in the language of circuits? There, you don’t have this problem. An AI can have a query-key lookup for proposition p and just not have a query-key lookup for the proposition porq. Instead, if someone asks whether porq is true, it first performs the lookup for p, then uses some general circuits for evaluating simple propositional logic to calculate that porq is true. This is an importantly different computational and mental process from having a query-key lookup for porq in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer’s proposed log of ‘propositional attitudes’ would do this. It’d describe both of these networks the same way, as having a propositional attitude of believing porq, discarding a distinction between them that is important for understanding the models’ mental state in a way that will let us do things such as successfully predicting the models’ behavior in a different situation.
I’m all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don’t matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.
I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he’s pointing at an important problem for interpretability; that it’s not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.
I think what this is showing is that Chalmer’s definition of “dispositional attitudes” has a problem: It lacks any notion of the amount and kind of computational labour required to turn ‘dispositional’ attitudes into ‘occurrent’ ones. That’s why he ends up with AI systems having an uncountably infinite number of dispositional attitudes.
This pretty much matches my sense so far, although I haven’t had time to finish reading the whole thing. I wonder whether this is due to the fact that he’s used to thinking about human brains, where we’re (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.
Clearly ANNs are able to represent propositional content, but I haven’t seen anything that makes me think that’s the natural unit of analysis.
I could imagine his lens potentially being useful for some sorts of analysis built on top of work from mech interp, but not as a core part of mech interp itself (unless it turns out that it happens to be true that propositions and propositional attitudes are the natural decomposition for ANNs, I suppose, but even if that happened it would seem like a happy coincidence rather than something that Chalmers has identified in advance).
Clearly ANNs are able to represent propositional content, but I haven’t seen anything that makes me think that’s the natural unit of analysis.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions. We even routinely communicate by uttering “statements”—which express propositions. So propositions are natural to us, which is why they are important for ANN interpretability.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions.
I’m not too confident of this. It seems to me that a lot of human cognition isn’t particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg ‘I am experiencing a sensation of pain in my big toe’, ‘I am imagining a picnic table’), but that doesn’t particularly seem like the natural lens to view those through.
That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.
I wonder whether this is due to the fact that he’s used to thinking about human brains, where we’re (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.
I don’t think this description is philosophically convenient. Believing p and believing things that imply p are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer’s definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.
Something that Chalmers finds convenient, anyhow. I’m not sure how else we could view ‘dispositional beliefs’ if not as a philosophical construct; surely Chalmers doesn’t imagine that ANNs or human brains actively represent ‘p-or-q’ for all possible q.
To be fair here, from an omniscient perspective, believing P and believing things that imply P are genuinely the same thing in terms of results, but from a non-omniscient perspective, the difference matters.
I don’t like it. It does not feel like a clean natural concept in the territory to me.
Case in point:
I think what this is showing is that Chalmer’s definition of “dispositional attitudes” has a problem: It lacks any notion of the amount and kind of computational labour required to turn ‘dispositional’ attitudes into ‘occurrent’ ones. That’s why he ends up with AI systems having an uncountably infinite number of dispositional attitudes.
One could try to fix up Chalmer’s definition by making up some notion of computational cost, or circuit complexity or something of the sort, that’s required to convert a dispositional attitude into an occurrent attitude, and then only list dispositional attitude up to some cost cutoff c we are free to pick as applications demand.
But I don’t feel very excited about that. At that point, what is this notion of “dispositional attitudes” really still providing us that wouldn’t be less cumbersome to describe in the language of circuits? There, you don’t have this problem. An AI can have a query-key lookup for proposition p and just not have a query-key lookup for the proposition porq. Instead, if someone asks whether porq is true, it first performs the lookup for p, then uses some general circuits for evaluating simple propositional logic to calculate that porq is true. This is an importantly different computational and mental process from having a query-key lookup for porq in the weights and just directly performing that lookup, so we ought to describe a network that does the former differently from a network that does the latter. It does not seem like Chalmer’s proposed log of ‘propositional attitudes’ would do this. It’d describe both of these networks the same way, as having a propositional attitude of believing porq, discarding a distinction between them that is important for understanding the models’ mental state in a way that will let us do things such as successfully predicting the models’ behavior in a different situation.
I’m all for trying to come up with good definitions for model macro-states which throw away tiny implementation details that don’t matter, but this definition does not seem to me to carve the territory in quite the right way. It throws away details that do matter.
I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he’s pointing at an important problem for interpretability; that it’s not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.
Agreed that there’s a problem there, but it’s not at all clear to me (as yet) that Chalmers’ view is a fruitful way to address that problem.
i do agree with that, although ‘step 1 is identify the problem’
This pretty much matches my sense so far, although I haven’t had time to finish reading the whole thing. I wonder whether this is due to the fact that he’s used to thinking about human brains, where we’re (AFAIK) nowhere near being able to identify the representation of specific concepts, and so we might as well use the most philosophically convenient description.
Clearly ANNs are able to represent propositional content, but I haven’t seen anything that makes me think that’s the natural unit of analysis.
I could imagine his lens potentially being useful for some sorts of analysis built on top of work from mech interp, but not as a core part of mech interp itself (unless it turns out that it happens to be true that propositions and propositional attitudes are the natural decomposition for ANNs, I suppose, but even if that happened it would seem like a happy coincidence rather than something that Chalmers has identified in advance).
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions. We even routinely communicate by uttering “statements”—which express propositions. So propositions are natural to us, which is why they are important for ANN interpretability.
I’m not too confident of this. It seems to me that a lot of human cognition isn’t particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg ‘I am experiencing a sensation of pain in my big toe’, ‘I am imagining a picnic table’), but that doesn’t particularly seem like the natural lens to view those through.
That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.
I don’t think this description is philosophically convenient. Believing p and believing things that imply p are genuinely different states of affairs in a sensible theory of mind. Thinking through concrete mech interp examples of the former vs. the latter makes it less abstract in what sense they are different, but I think I would have objected to Chalmer’s definition even back before we knew anything about mech interp. It would just have been harder for me to articulate what exactly is wrong with it.
Something that Chalmers finds convenient, anyhow. I’m not sure how else we could view ‘dispositional beliefs’ if not as a philosophical construct; surely Chalmers doesn’t imagine that ANNs or human brains actively represent ‘p-or-q’ for all possible q.
To be fair here, from an omniscient perspective, believing P and believing things that imply P are genuinely the same thing in terms of results, but from a non-omniscient perspective, the difference matters.