This shortpost is just a reference post for the following point:
It’s very easy for conversations about LLM beliefs or goals or values to get derailed by questions about whether an LLM can genuinely said to believe something, or to have a goal, or to hold a value. These are valid questions! But there are other important questions about LLMs that touch on these subjects, which don’t turn on whether an LLM belief is a “real” belief. It’s not productive for those discussions to be so frequently derailed.
I’ve taken various approaches to this problem in my writing, but David Chalmers, in his recent paper ‘What We Talk to When We Talk to Language Models’ (pp 3–6), introduces a useful piece of terminology. He proposes that we use terms like ‘quasi-belief’ to set those questions aside, to denote that the point we’re making doesn’t rely on LLM beliefs being ‘real’ beliefs in some deep sense:
The view I call quasi-interpretivism says that a system has a quasi-belief that p if it is behaviorally interpretable as believing that p (according to an appropriate interpretation scheme), and likewise for quasi-desire. This definition of quasi-belief is exactly the same as interpretivism’s definition of belief. The only difference is that where standard interpretivism offers these definitions as a theory of belief, quasi-interpretivism does not. It offers them simply as a stipulative theory of quasi-belief. Quasi-interpretivism does not say anything about whether LLMs have beliefs and desires. But it does make it plausible to say that LLMs have quasi-beliefs and quasi-desires, on the grounds that LLMs are at least interpretable in the right way. Even if quasi-beliefs and quasi-desires fall short of being genuine beliefs and desires, they can still play some of the key roles of beliefs and desires in explaining behavior. For example, if an LLM quasi-believes that giving a certain solution would be the most helpful thing it could do to solve a problem, and it quasi-desires to do the most helpful thing it can, then other things being equal, it will [give that solution].
[emphasis mine]
I expect to largely adopt this terminology going forward, linking back to this shortpost as needed. For convenience, I expect to extend it slightly to include terms like ‘quasi-goal’ and ‘quasi-value’. Also, unless otherwise specified, if I use a term like ‘quasi-belief’, later occurrences of ‘belief’ in the same text should be read as ‘quasi-belief’.
IIRC Eric Schwitzgebel wrote something in a similar vein (not necessarily about LLMs, though he has been interested in this sort of stuff too, recently). I’m unable to dig out the most relevant reference atm but some related ones are:
https://faculty.ucr.edu/~eschwitz/SchwitzAbs/Snails.htm (relevant not because it talks about beliefs (I don’t recall it does) but because it argues for the possibility of an organism being “kinda-X” where X is a property that we tend to think is binary)
Understandable attempt by Chalmers, but I’d say that bit, at least, is opposite to the direction of clarity.
The idea > if it is behaviorally interpretable as believing that p
reinforces that even if we can’t rely on “beliefs” of AI systems to mean what they usually mean, we can rely on “behavior” of AI systems to mean what it usually means, typically with humans or some other animal as the reference class. You might try to fix that with the same trick, adding a quasi- prefix to behavior and calling it “quasi-behavior”, but then you have to specify what your new grounding for quasi-behavior is. And so on.
It feels tempting to to use—or no, it feels unfair to be denied—some handle that serves the felt sense of “But when I interact with Claude, it is very useful and predictive to see it as ‘planning’ to troubleshoot X and ‘believing’ that some file is in some folder. Isn’t it better for me to flag with quasi- how it’s sort of true and sort of false?”
The problem with “quasi-” is that it is trying to avoid the spikiness/jaggedness/alienness of what we might call AI minds, whereas good frames and vocabulary should remind us to be constantly vigilant about the differences in different contexts. That we can’t get away with “sort of true and sort of false.” That instead, we should be paying attention to the fine-grained differences in each context, and how extrapolation will fail. That’s how you respect the alien-ness.
In the link, Chalmers dismisses such concerns:
> An opponent might deny that LLMs have quasi-beliefs or quasi-desires on the grounds that LLM behavior is unstable, or non-humanlike, or otherwise defective in a way that means that the LLM is not even usefully interpretable in terms of beliefs or desires. [...] A core of consistency is enough for interpretation to get a grip in ascribing numerous quasi-beliefs and quasi-desires, even though there will be domains where they lack these states on grounds of inconsistency. Overall I think that experience with current LLMs suggests that there is enough of a consistent core to support a reasonably extensive core of quasi-beliefs.
A better analogy that I’ve proposed before is rationalization. Calling rationalization “quasi-rationality”, makes the absurdity clearer. Rationalization isn’t sort-of rational. Rationalization doesn’t “play some key roles” of rationality. Rationalization and rationality do not share a “core” that are usefully co-extensive.
Don’t underestimate the adversarial institutional reification of anthropomorphism here. Don’t mistake anti-inductive nature for harmless un-inductive nature or worse, inductive nature. That’s like mistaking rationalization for just noise, or worse, essentially rationality. Rationalization is a kind of referential parasitism on the phenomena of rationality, and the reason to consider it adjacent to rationality is only to be watchful of how it is cleverly simulating your familiar notion.
This is the Sharp Left Turn of referential alignment. Don’t fall for the similarity. AI minds and bodies do not refer like human minds and bodies do. Our referential activity may be very similar up to a point (and incrementally duct-taped and patched to fix any seeming discrepancies) and then totally bizarre beyond specific contexts. Reliance on some “core” will create bad shocks.
I’ve critiqued elsewhere the dependence on northstars of “cores”, “invariants”, “convergences” in general, as only being able to deal with the intersection of phenomena. Hopefully this becomes more compelling as alternative methodologies become possible with AI-assistance, as outlined somewhat in the previous link. (Also more compelling as AI systems prove our existing models and metaphors cannot be simply repurposed with minor modifications.) Instead of talking about quasi-beliefs, you might create a label for the X-belief for each different context X, that may have extremely specific connections and disconnections with the various implications we tend to assume for beliefs. This would require tracking “disorders”, where AI systems absurdly do only some of the things that you would normally do with “beliefs” and “selfs”.
The commitment to non-anthropomorphism is more clearly now an ongoing practice beyond words, not something we can do with abstract analysis or redefining terms. It will soon be as hard as or harder than any other systemic issue today, with subtle ideological collusion to keep you convinced that the artificial substitute is basically no different than the real thing and how any toxic seams will be ironed out in v0.4.
(Don’t mistake this as being apathetic to potential machine suffering. On the contrary, this vigilance of our projections should mitigate issues of reverse alignment—where we assume happy text content output is synonymous with machine welfare but they’re suffering inside. Dealmaking proposals are often great examples of this unthinkingness.)
[Chalmers claims that] we can rely on “behavior” of AI systems to mean what it usually means, typically with humans or some other animal as the reference class. You might try to fix that with the same trick, adding a quasi- prefix to behavior and calling it “quasi-behavior”, but then you have to specify what your new grounding for quasi-behavior is. And so on.
I read ‘behavior’ as pointing to something explicit and observable (eg ‘the LLM produced the following sequence of tokens’), which doesn’t have the sort of ambiguity that would make the ‘quasi-’ prefix necessary.
I think one could make an argument that ‘interpretable as’ is questionable, since any behavior can be interpreted in arbitrarily many ways[1] — but that doesn’t seem like the argument you’re making.
It may be helpful here to clarify that the intention with the ‘quasi-’ terminology isn’t to claim to have resolved what relationship LLM ‘beliefs’ bear to beliefs in the usual sense; there are a range of stances that could be taken on that. The intention, at least for me, is to be able to talk about something other than that relationship, which is often valuable.
While this matters for me more for research purposes, it can even be completely prosaic. When we talk about an LLM writing code, it might be helpful to discuss whether it believes itself to be writing code for Mac or Linux or Windows, since those might involve different library calls. Once that was mentioned, there are people who would promptly speak up to say ‘Ha ha no, you’re totally confused, LLMs don’t have beliefs’[2]. At that point it’s helpful to be able to say, ‘Fine, but does it quasi-believe itself to be writing code for Linux?’ rather than have the question of which library it’s likely to call derailed by a lengthy digression about the status of beliefs in LLMs.
This is a reasonable argument, but often the natural interpretation isn’t under dispute—eg we can generally agree that some of the behavior exhibited by Atari game-playing AI is most naturally interpretable as trying to increase the score.
I’m having trouble seeing why someone would want to apply it to humans, since it’s generally not in question that humans can have real beliefs and real desires. But I guess if there were uncertainty about whether some particular person has real beliefs, we could set that uncertainty aside by talking about their quasi-beliefs[1].
In the interest of having a somewhat forced concrete example, maybe we’ve started to suspect that our friend Dan is a p-zombie and we often debate that, but right now we just want to talk about whether he’s figured out that we’re planning a surprise party for him, so we set aside the p-zombie issue by talking about whether Dan quasi-believes our story that we only bought confetti in case there was a confetti shortage coming up.
I would guess the type signature of human beliefs and goals and desires is at least fairly often closer to the LLM quasi-x than to the crisp mathematical idealizations of those concepts.
Humans are kinda a world model with a self-character, I think distancing LLMs from this by implying that LLMs beliefs, goals, desires are super different brings people’s beliefs further from tracking reality.
I think that in ordinary usage, whatever sort of things humans have, that’s what we mean when we say ‘belief’, ‘goal’, etc. Insofar as anyone thinks those are crisp mathematical abstractions, that seems like a separate and additional claim. I worry that saying ‘humans don’t actually have beliefs’ makes it pretty unclear what ‘belief’ even means[1].
As James points out in another comment, the ‘quasi-’ framing is solely intended to set aside questions about whether LLM beliefs (etc) are ‘real’ beliefs and whether they’re fundamentally the same as human beliefs, not to take a stance that they’re not. Chalmers: ‘Quasi-interpretivism does not say anything about whether LLMs have beliefs and desires’. There are a lot of interesting and safety-relevant discussions to be had about what LLMs believe in a practical sense (eg ‘Does this model believe that Paris is in France or Germany?’), and I see this terminology as basically just a way to prevent such discussions from being counterproductively derailed by questions about whether a model can actually believe anything at all.
Maybe it’s suggesting a highly deflationary stance, in the same way that illusionists think humans aren’t actually conscious? But consciousness is a highly abstract and contested topic, whereas there’s a pretty ordinary and uincontested sense in which humans believe things, have desires, etc.
Seems worthwhile as a way to simplify conversations with people who seem to be too be confused, but I think this isn’t a reality mapping exercise and probably makes it harder to see the structure of reality which is kinda sad even if useful for talking with some people?
I agree that the terminology is useful to bracket metaphysical discussion of LLM mental states but I’d just caution us as a community to use the term ‘quasi-belief’ really carefully. Specifically, I could see it being employed to import heavyweight metaphysical assumptions that aren’t justified or are lightly argued for.
Concretely, there are two potential ways to use it:
I don’t know if LLM’s have genuine beliefs and it’s not load bearing for my argument so let me bracket the conversation by using the term ‘quasi-belief.’
LLM’s don’t have genuine beliefs, instead they have ‘quasi-beliefs’.
I think 1) is totally fine and is the intended usage. 2) is only fine if it’s backed up with some solid argument.
To be sure, your post and the Chalmers paper use it correctly as 1) but I could see its meaning slipping to 2) as it gets more widely deployed.
I agree entirely that ‘quasi-belief’ is solely a way of setting aside those questions and shouldn’t be taken as a claim about the answers, much less as a load-bearing argument in its own right.
One thing that would make me hesitate to use ~ is that it already commonly means ‘approximately equal to’ (as a more-easily-typed substitute for ≈). That certainly feels like a related meaning, but what I appreciate about Chalmers’ coinage is that it’s very precise about what you are and are not claiming.
Assuming people accept the model that LLM behavior is primarily determined by modeling the behavior of some subset (such that fine-tuning works primarily by shaping the subset of humans that the model emulates) of the human writers it was trained on, it might be simplest to ask whether the model “behaves like a person who believes X”.
This framing carries practical benefits (again, so long as you agree with the assumption above), in that the fine-tuning paradigm can be examined in the context of identifying what causes the model to upweight, say, the “a black-hat hacker is writing this document” neuron, and check this work against human data. This has already shown experimental success in the opposite direction—if you train a model to upweight the “Hitler is writing this document” neuron, the model will write as if it believes itself to be Hitler.
I agree that there are a lot of interesting questions about how to think about the subset of training data authors and characters that shape a particular model response, and I think it would be an interesting project to try to define a useful metric on that. I see that as separate from Chalmers’ coinage here, though, which is more about specifying what questions we’re not trying to answer.
I have also seen conversations get derailed based on such disagreements.
I expect to largely adopt this terminology going forward
May I ask to which audience(s) you think this terminology will be helpful? And what particular phrasing(s) do you plan on trying out?
The quote above from Chalmers is dense and rather esoteric; so I would hesitate to use its particular terminology for most people (the ones likely to get derailed as discussed above). Instead, I would seek out simpler language. As a first draft, perhaps I would say:
Let’s put aside whether LLMs think on the inside. Let’s focus on what we observe—are these observations consistent with the word “thinking”?
Good point that the Chalmers quote isn’t going to be helpful to everyone. In practice, I’m mostly imagining giving a quick informal sense of what I mean by eg ‘quasi-thinking’, or even just having a parenthetical aside with a link back to this post if people want to dive deeper, eg I might write something like
It seems clear that LLMs believe (or quasi-believe) most of the facts presented in synthetic document fine-tuning.
I think you’re right in pointing to observable consequences in your paraphrase. In informal discussion, I’ve found it useful to say things like
When I say ‘the model has goal X’, I don’t mean to make a claim about whether the model ‘really’ has goals in some deep sense; I just mean that for practical purposes the model consistently behaves as if it has goal X.
I’ve edited the original post slightly to give a plainer meaning before the Chalmers quote.
Quasi-beliefs
This shortpost is just a reference post for the following point:
It’s very easy for conversations about LLM beliefs or goals or values to get derailed by questions about whether an LLM can genuinely said to believe something, or to have a goal, or to hold a value. These are valid questions! But there are other important questions about LLMs that touch on these subjects, which don’t turn on whether an LLM belief is a “real” belief. It’s not productive for those discussions to be so frequently derailed.
I’ve taken various approaches to this problem in my writing, but David Chalmers, in his recent paper ‘What We Talk to When We Talk to Language Models’ (pp 3–6), introduces a useful piece of terminology. He proposes that we use terms like ‘quasi-belief’ to set those questions aside, to denote that the point we’re making doesn’t rely on LLM beliefs being ‘real’ beliefs in some deep sense:
[emphasis mine]
I expect to largely adopt this terminology going forward, linking back to this shortpost as needed. For convenience, I expect to extend it slightly to include terms like ‘quasi-goal’ and ‘quasi-value’. Also, unless otherwise specified, if I use a term like ‘quasi-belief’, later occurrences of ‘belief’ in the same text should be read as ‘quasi-belief’.
You might like my quick take from a week ago https://www.lesswrong.com/posts/ydfHKHHZ7nNLi2ykY/jan-betley-s-shortform?commentId=fEh8jnfTrfkQFf3mD
Ah, yep, totally! I actually searched to see if anyone else had ~written this, but I think maybe shortposts don’t show up as search results.
There’s also @eleni-angelou’s The Intentional Stance, LLMs Edition from April 2024; like you, she points to the connection to Dennett.
IIRC Eric Schwitzgebel wrote something in a similar vein (not necessarily about LLMs, though he has been interested in this sort of stuff too, recently). I’m unable to dig out the most relevant reference atm but some related ones are:
https://faculty.ucr.edu/~eschwitz/SchwitzAbs/PragBel.htm
https://eschwitz.substack.com/p/the-fundamental-argument-for
https://faculty.ucr.edu/~eschwitz/SchwitzAbs/Snails.htm (relevant not because it talks about beliefs (I don’t recall it does) but because it argues for the possibility of an organism being “kinda-X” where X is a property that we tend to think is binary)
Also: https://en.wikipedia.org/wiki/Alief_(mental_state)
Understandable attempt by Chalmers, but I’d say that bit, at least, is opposite to the direction of clarity.
The idea
> if it is behaviorally interpretable as believing that p
reinforces that even if we can’t rely on “beliefs” of AI systems to mean what they usually mean, we can rely on “behavior” of AI systems to mean what it usually means, typically with humans or some other animal as the reference class. You might try to fix that with the same trick, adding a quasi- prefix to behavior and calling it “quasi-behavior”, but then you have to specify what your new grounding for quasi-behavior is. And so on.
It feels tempting to to use—or no, it feels unfair to be denied—some handle that serves the felt sense of “But when I interact with Claude, it is very useful and predictive to see it as ‘planning’ to troubleshoot X and ‘believing’ that some file is in some folder. Isn’t it better for me to flag with quasi- how it’s sort of true and sort of false?”
The problem with “quasi-” is that it is trying to avoid the spikiness/jaggedness/alienness of what we might call AI minds, whereas good frames and vocabulary should remind us to be constantly vigilant about the differences in different contexts. That we can’t get away with “sort of true and sort of false.” That instead, we should be paying attention to the fine-grained differences in each context, and how extrapolation will fail. That’s how you respect the alien-ness.
In the link, Chalmers dismisses such concerns:
> An opponent might deny that LLMs have quasi-beliefs or quasi-desires on the grounds that LLM behavior is unstable, or non-humanlike, or otherwise defective in a way that means that the
LLM is not even usefully interpretable in terms of beliefs or desires. [...] A core of consistency is enough for interpretation to get a grip in ascribing numerous quasi-beliefs and quasi-desires, even though there will be domains where they lack these states on grounds of inconsistency. Overall I think that experience with current LLMs suggests that there is enough of a consistent core to support a reasonably extensive core of quasi-beliefs.
A better analogy that I’ve proposed before is rationalization. Calling rationalization “quasi-rationality”, makes the absurdity clearer. Rationalization isn’t sort-of rational. Rationalization doesn’t “play some key roles” of rationality. Rationalization and rationality do not share a “core” that are usefully co-extensive.
Don’t underestimate the adversarial institutional reification of anthropomorphism here. Don’t mistake anti-inductive nature for harmless un-inductive nature or worse, inductive nature. That’s like mistaking rationalization for just noise, or worse, essentially rationality. Rationalization is a kind of referential parasitism on the phenomena of rationality, and the reason to consider it adjacent to rationality is only to be watchful of how it is cleverly simulating your familiar notion.
This is the Sharp Left Turn of referential alignment. Don’t fall for the similarity. AI minds and bodies do not refer like human minds and bodies do. Our referential activity may be very similar up to a point (and incrementally duct-taped and patched to fix any seeming discrepancies) and then totally bizarre beyond specific contexts. Reliance on some “core” will create bad shocks.
I’ve critiqued elsewhere the dependence on northstars of “cores”, “invariants”, “convergences” in general, as only being able to deal with the intersection of phenomena. Hopefully this becomes more compelling as alternative methodologies become possible with AI-assistance, as outlined somewhat in the previous link. (Also more compelling as AI systems prove our existing models and metaphors cannot be simply repurposed with minor modifications.) Instead of talking about quasi-beliefs, you might create a label for the X-belief for each different context X, that may have extremely specific connections and disconnections with the various implications we tend to assume for beliefs. This would require tracking “disorders”, where AI systems absurdly do only some of the things that you would normally do with “beliefs” and “selfs”.
The commitment to non-anthropomorphism is more clearly now an ongoing practice beyond words, not something we can do with abstract analysis or redefining terms. It will soon be as hard as or harder than any other systemic issue today, with subtle ideological collusion to keep you convinced that the artificial substitute is basically no different than the real thing and how any toxic seams will be ironed out in v0.4.
(Don’t mistake this as being apathetic to potential machine suffering. On the contrary, this vigilance of our projections should mitigate issues of reverse alignment—where we assume happy text content output is synonymous with machine welfare but they’re suffering inside. Dealmaking proposals are often great examples of this unthinkingness.)
Thanks.
I read ‘behavior’ as pointing to something explicit and observable (eg ‘the LLM produced the following sequence of tokens’), which doesn’t have the sort of ambiguity that would make the ‘quasi-’ prefix necessary.
I think one could make an argument that ‘interpretable as’ is questionable, since any behavior can be interpreted in arbitrarily many ways[1] — but that doesn’t seem like the argument you’re making.
It may be helpful here to clarify that the intention with the ‘quasi-’ terminology isn’t to claim to have resolved what relationship LLM ‘beliefs’ bear to beliefs in the usual sense; there are a range of stances that could be taken on that. The intention, at least for me, is to be able to talk about something other than that relationship, which is often valuable.
While this matters for me more for research purposes, it can even be completely prosaic. When we talk about an LLM writing code, it might be helpful to discuss whether it believes itself to be writing code for Mac or Linux or Windows, since those might involve different library calls. Once that was mentioned, there are people who would promptly speak up to say ‘Ha ha no, you’re totally confused, LLMs don’t have beliefs’[2]. At that point it’s helpful to be able to say, ‘Fine, but does it quasi-believe itself to be writing code for Linux?’ rather than have the question of which library it’s likely to call derailed by a lengthy digression about the status of beliefs in LLMs.
This is a reasonable argument, but often the natural interpretation isn’t under dispute—eg we can generally agree that some of the behavior exhibited by Atari game-playing AI is most naturally interpretable as trying to increase the score.
You can see some examples of this sort of thing in Robert Wright’s recent podcast with Emily Bender and Alex Hanna.
I’d guess this terminology is fairly applicable to humans too?
I’m having trouble seeing why someone would want to apply it to humans, since it’s generally not in question that humans can have real beliefs and real desires. But I guess if there were uncertainty about whether some particular person has real beliefs, we could set that uncertainty aside by talking about their quasi-beliefs[1].
In the interest of having a somewhat forced concrete example, maybe we’ve started to suspect that our friend Dan is a p-zombie and we often debate that, but right now we just want to talk about whether he’s figured out that we’re planning a surprise party for him, so we set aside the p-zombie issue by talking about whether Dan quasi-believes our story that we only bought confetti in case there was a confetti shortage coming up.
I would guess the type signature of human beliefs and goals and desires is at least fairly often closer to the LLM quasi-x than to the crisp mathematical idealizations of those concepts.
Humans are kinda a world model with a self-character, I think distancing LLMs from this by implying that LLMs beliefs, goals, desires are super different brings people’s beliefs further from tracking reality.
I think that in ordinary usage, whatever sort of things humans have, that’s what we mean when we say ‘belief’, ‘goal’, etc. Insofar as anyone thinks those are crisp mathematical abstractions, that seems like a separate and additional claim. I worry that saying ‘humans don’t actually have beliefs’ makes it pretty unclear what ‘belief’ even means[1].
As James points out in another comment, the ‘quasi-’ framing is solely intended to set aside questions about whether LLM beliefs (etc) are ‘real’ beliefs and whether they’re fundamentally the same as human beliefs, not to take a stance that they’re not. Chalmers: ‘Quasi-interpretivism does not say anything about whether LLMs have beliefs and desires’. There are a lot of interesting and safety-relevant discussions to be had about what LLMs believe in a practical sense (eg ‘Does this model believe that Paris is in France or Germany?’), and I see this terminology as basically just a way to prevent such discussions from being counterproductively derailed by questions about whether a model can actually believe anything at all.
Maybe it’s suggesting a highly deflationary stance, in the same way that illusionists think humans aren’t actually conscious? But consciousness is a highly abstract and contested topic, whereas there’s a pretty ordinary and uincontested sense in which humans believe things, have desires, etc.
Seems worthwhile as a way to simplify conversations with people who seem to be too be confused, but I think this isn’t a reality mapping exercise and probably makes it harder to see the structure of reality which is kinda sad even if useful for talking with some people?
I agree that the terminology is useful to bracket metaphysical discussion of LLM mental states but I’d just caution us as a community to use the term ‘quasi-belief’ really carefully. Specifically, I could see it being employed to import heavyweight metaphysical assumptions that aren’t justified or are lightly argued for.
Concretely, there are two potential ways to use it:
I don’t know if LLM’s have genuine beliefs and it’s not load bearing for my argument so let me bracket the conversation by using the term ‘quasi-belief.’
LLM’s don’t have genuine beliefs, instead they have ‘quasi-beliefs’.
I think 1) is totally fine and is the intended usage. 2) is only fine if it’s backed up with some solid argument.
To be sure, your post and the Chalmers paper use it correctly as 1) but I could see its meaning slipping to 2) as it gets more widely deployed.
I agree entirely that ‘quasi-belief’ is solely a way of setting aside those questions and shouldn’t be taken as a claim about the answers, much less as a load-bearing argument in its own right.
I’ve been using a tilde (e.g. ~belief) for denoting this, which maybe has less baggage than “quasi-” and is a lot easier to type.
It funny, one of the main use-cases of this terminology is when I’m talking to LLMs themselves about these things.
One thing that would make me hesitate to use
~is that it already commonly means ‘approximately equal to’ (as a more-easily-typed substitute for≈). That certainly feels like a related meaning, but what I appreciate about Chalmers’ coinage is that it’s very precise about what you are and are not claiming.Assuming people accept the model that LLM behavior is primarily determined by modeling the behavior of some subset (such that fine-tuning works primarily by shaping the subset of humans that the model emulates) of the human writers it was trained on, it might be simplest to ask whether the model “behaves like a person who believes X”.
This framing carries practical benefits (again, so long as you agree with the assumption above), in that the fine-tuning paradigm can be examined in the context of identifying what causes the model to upweight, say, the “a black-hat hacker is writing this document” neuron, and check this work against human data. This has already shown experimental success in the opposite direction—if you train a model to upweight the “Hitler is writing this document” neuron, the model will write as if it believes itself to be Hitler.
I agree that there are a lot of interesting questions about how to think about the subset of training data authors and characters that shape a particular model response, and I think it would be an interesting project to try to define a useful metric on that. I see that as separate from Chalmers’ coinage here, though, which is more about specifying what questions we’re not trying to answer.
I have also seen conversations get derailed based on such disagreements.
May I ask to which audience(s) you think this terminology will be helpful? And what particular phrasing(s) do you plan on trying out?
The quote above from Chalmers is dense and rather esoteric; so I would hesitate to use its particular terminology for most people (the ones likely to get derailed as discussed above). Instead, I would seek out simpler language. As a first draft, perhaps I would say:
Good point that the Chalmers quote isn’t going to be helpful to everyone. In practice, I’m mostly imagining giving a quick informal sense of what I mean by eg ‘quasi-thinking’, or even just having a parenthetical aside with a link back to this post if people want to dive deeper, eg I might write something like
I think you’re right in pointing to observable consequences in your paraphrase. In informal discussion, I’ve found it useful to say things like
I’ve edited the original post slightly to give a plainer meaning before the Chalmers quote.