Making the "stance" explicit

Often, when people explain the behavior of an AI system, there is one of Daniel Dennett’s stances implicit in their explanation. I think that when this happens implicitly, it can often lead to misleading claims and sloppy reasoning. For this reason, I feel it can be really valuable to be deliberately explicit which “stance” one is using. Here are Dennett’s three stances (in my own words):

Physical/structural stance: Explaining a system’s behavior mechanically, as the interaction between its parts and with the outside environment. For example, we could explain the output of a neural net by looking at which combinations of neurons fired to bring that about.

Design/functional stance: Explaining a system by looking at what its function is, or what it was made to do. For example, we could explain the behavior of a thermometer by the fact that its function is to track the temperature.

Intentional stance: Explaining a system’s behavior as a product of beliefs, desires, or mental reasoning of the system. For example, my friend got up to get a cup of coffee because they believed there was coffee in the other room, wanted to have the coffee, and reasoned that to get the coffee they had to get up.

The idea is that all three stances can in principle be applied to any system, but that for different systems each stance offers a different amount of explanatory power. For example, if I tried to understand my friend in terms of their design (perhaps the evolutionary pressures placed upon humans, or the in-lifetime learning applied to my friend’s brain) this would get me a lot less bang for my buck than my much simpler model of them as wanting coffee. Likewise, thinking of a thermometer as “wanting” to track the temperature and “believing” that the temperature outside does not match what it is displaying is unnecessarily complicated.

Statements can sometimes make implicit assertions about which stance is the most useful. For example, consider: “GPT is just a text predictor.” This statement correctly identifies that from a design stance, GPT was made to predict next-tokens on a text corpus^[1], but it goes a step further to implicitly claim that we should only understand GPT in this way (and that we shouldn’t be curious about explaining the behavior in other ways). I think it can be important to notice this kind of thing, and try to make these implicit claims more explicit.

Another example of how making this more explicit can be valuable is Reward is not the optimization target by Alex Turner. Turner points out that in RL, researchers quite carelessly use intentional language around RL agents having a goal of obtaining reward. This claim could be true, but the fact that agents are “designed” (or rather trained) to pursue high reward policies does not directly imply that they should have this as their goal (or have any goal at all). This careless language leads to careless thinking, and can lead us to misunderstand what we should expect from AI systems.

It might also be valuable for us to have more stances, or a more nuanced look at these stances. For example, when interacting with base models, I personally find that while I don’t feel like I gain much by modeling them as having goals^[2], I gain a lot by modeling them as having beliefs. Dennett’s stances here might be too anchored to an anthropocentric view of the world. Privileging the intentional stance makes sense, as we are particularly worried about agential risks from AI. I worry, however, that we aren’t being curious enough about looking for and explaining weird behavior of AI systems that isn’t well explained by any of the three stances. For example, take flocking behavior in birds:

In this case, none of the three stances do a particularly good job of explaining the complex patterns that particularly large flocks of birds make, and it seems like we need some kind of fourth stance in order to make good predictions about it. My hope is that if we are more explicit about which stance we are using in our explanations, we can also more easily notice what those stances might be leaving out, and what is left to explore and explain.

^
Even from the design stance, things do get complicated. Chat models are the obvious example, being trained to do more than just prediction, but even sticking to just self-supervised next-token prediction directly implies quite a lot which the word “just” might encourage us to ignore.
^
I could say that a base model “wants” to minimize prediction error, but this feels like it doesn’t add anything to just taking a design stance.

Making the “stance” explicit