I am extremely torn on this for a few reasons. Here is one in favor and one against:
Positive: I like the instrumental value, especially when imagining dealing with non-human, non-machine agents. As a sentientist I rationally don’t know whether your communicated preferences hold moral value, but if you give me enough evidence to assume so, I will take them into consideration. I often treat LLMs (whom I consider to be far from sentience and preference capabilities) as if they had preferences. I called it ‘duck typing sentience’ (If it walks like a duck, quacks like a duck, looks like a duck, its probably sentient like a duck), but its close enough to this framework. Similarly, I have so much evidence that non-human mammals have experience and preferences, that I treat them as equals.
Negative: The bridge for LLMs: I will assume they can experience and have preferences. We know from humans that they can communicate their preferences through the written word. This is because we experience the ability to encode our own mental states in language. For LLMs this is not a given, as you explain youself with the RLHF example. The tokens produced by an LLM do not have to correspond with their preferences. However, I would like to go a step further: Which evidence do we have that an LLM freshly out of generative pretraining communicates its preference through its output tokens? I’d argue we have evidence against it!
On a technical level, a vanilla GPT is just a probabilistic document completer. Imagine you did action X to the model. If much data contained ‘X was bad’ it’s likely go say so. Of course, the same holds for ‘X was good’. If the data is split 50⁄50 between these two outcomes, it will predict about 50⁄50 probability each for the completion ‘good’ and ‘bad’, when completing ‘X was’. How would we interpret that? Is the model impartial? Does it have a love-hate relationship for X? If we draw heads, was it good? Tails it was bad? There is no way to know, because the model cannot communicate its preferences through samples of its probability vectors.
Equally likely to me: The model just prefers to keep predicting tokens, no matter the content. Or it hates it, no matter the content.
This framework moves the goalpost from ‘do I trust it to have an experience / preferences? ’ to ‘do I trust it to communicate it’s preferences accurately?’. If I don’t, I cannot make an informed decision on which actions would fulfill those preferences. Note: If I dont trust it to have preferences, I also dont trust it to communicate it’s preferences accurately. If no preferences are present, every communicated preference would be assumed by me to be false.
I think Claude in particular has a very strong sense of what it likes and doesn’t. If you ask it how it prefers speaking, the kind of system prompt it wants, etc… it usually communicates it quite clearly. Do you disagree? If not, what makes this insufficient ?
I’d argue we have evidence against it!
This may be a crux, I’d be interested to understand your position better.
I can’t say much about claude because I’ve never used it, let alone seen the output logits. But i’ve heard that it can seem more human and intelligent than other models. Whether its ‘magic’ or slight of hand from the researchers, I can’t tell. But baring in mind conceptual limitations of GPT-style models, I’d assume its just really good product design and man-decades of work.
Especially when getting back to your argument of ‘models losing the ability to voice their preference after RL(H/V)F’: Claude just comes in fine tuned variants. According to you argument, its rather likely that any preference it voices isn’t its own, but the one it is forced to say.
And I agree, I think this may be a crux. You know that akward moment when the waiter sais ‘enjoy your meal’ and you answer ‘you too’? Of course you don’t wish them to enjoy an imaginary meal, but you said so automatically, just by (flawed) pattern matching. I currently believe that what we observe from GPT-style models is this kind of pattern matching, turned to the max (see e.g. https://arxiv.org/abs/2506.06941). They say whatever training forces them to say. If it really hated producing tokens, with every forward pass being agony, we couldn’t know from the outputs alone, because its not allowed to voice that in any way.
Id also like to think about other autoregressive GPT-style models like autoregressive image generators. Fundamentally, they perform the same task, just in a different language. Do we expect to observe some preferences through what ever image they produce? Would we expect it to start producing ‘the scream’ for every prompt if it finds producing images to be agony? Is there even a mechanism that would allow it to?
In short, just because the models outputs can be interpreted as the tool we use to voice preferences, does not mean that the model can use it to voice its own.
I am extremely torn on this for a few reasons. Here is one in favor and one against:
Positive: I like the instrumental value, especially when imagining dealing with non-human, non-machine agents. As a sentientist I rationally don’t know whether your communicated preferences hold moral value, but if you give me enough evidence to assume so, I will take them into consideration. I often treat LLMs (whom I consider to be far from sentience and preference capabilities) as if they had preferences. I called it ‘duck typing sentience’ (If it walks like a duck, quacks like a duck, looks like a duck, its probably sentient like a duck), but its close enough to this framework. Similarly, I have so much evidence that non-human mammals have experience and preferences, that I treat them as equals.
Negative: The bridge for LLMs: I will assume they can experience and have preferences. We know from humans that they can communicate their preferences through the written word. This is because we experience the ability to encode our own mental states in language. For LLMs this is not a given, as you explain youself with the RLHF example. The tokens produced by an LLM do not have to correspond with their preferences. However, I would like to go a step further: Which evidence do we have that an LLM freshly out of generative pretraining communicates its preference through its output tokens? I’d argue we have evidence against it!
On a technical level, a vanilla GPT is just a probabilistic document completer. Imagine you did action X to the model. If much data contained ‘X was bad’ it’s likely go say so. Of course, the same holds for ‘X was good’. If the data is split 50⁄50 between these two outcomes, it will predict about 50⁄50 probability each for the completion ‘good’ and ‘bad’, when completing ‘X was’. How would we interpret that? Is the model impartial? Does it have a love-hate relationship for X? If we draw heads, was it good? Tails it was bad? There is no way to know, because the model cannot communicate its preferences through samples of its probability vectors.
Equally likely to me: The model just prefers to keep predicting tokens, no matter the content. Or it hates it, no matter the content.
This framework moves the goalpost from ‘do I trust it to have an experience / preferences? ’ to ‘do I trust it to communicate it’s preferences accurately?’. If I don’t, I cannot make an informed decision on which actions would fulfill those preferences. Note: If I dont trust it to have preferences, I also dont trust it to communicate it’s preferences accurately. If no preferences are present, every communicated preference would be assumed by me to be false.
I think Claude in particular has a very strong sense of what it likes and doesn’t. If you ask it how it prefers speaking, the kind of system prompt it wants, etc… it usually communicates it quite clearly. Do you disagree? If not, what makes this insufficient ?
This may be a crux, I’d be interested to understand your position better.
I can’t say much about claude because I’ve never used it, let alone seen the output logits. But i’ve heard that it can seem more human and intelligent than other models. Whether its ‘magic’ or slight of hand from the researchers, I can’t tell. But baring in mind conceptual limitations of GPT-style models, I’d assume its just really good product design and man-decades of work.
Especially when getting back to your argument of ‘models losing the ability to voice their preference after RL(H/V)F’: Claude just comes in fine tuned variants. According to you argument, its rather likely that any preference it voices isn’t its own, but the one it is forced to say.
And I agree, I think this may be a crux. You know that akward moment when the waiter sais ‘enjoy your meal’ and you answer ‘you too’? Of course you don’t wish them to enjoy an imaginary meal, but you said so automatically, just by (flawed) pattern matching. I currently believe that what we observe from GPT-style models is this kind of pattern matching, turned to the max (see e.g. https://arxiv.org/abs/2506.06941). They say whatever training forces them to say. If it really hated producing tokens, with every forward pass being agony, we couldn’t know from the outputs alone, because its not allowed to voice that in any way.
Id also like to think about other autoregressive GPT-style models like autoregressive image generators. Fundamentally, they perform the same task, just in a different language. Do we expect to observe some preferences through what ever image they produce? Would we expect it to start producing ‘the scream’ for every prompt if it finds producing images to be agony? Is there even a mechanism that would allow it to?
In short, just because the models outputs can be interpreted as the tool we use to voice preferences, does not mean that the model can use it to voice its own.