Pronouns: Any / All
Computer Science Scholar (finishing my Masters Degree and looking for a PHD), currently working in AI (Generative Modeling, Score Modeling), and Scentientist.
I am currently convinced that Sentience is not a function of intelligence, but a side effect of the machinery facilitating said intelligence. This would mean that it cannot be inferred from behaviour. I hope to find existing theories and counter arguments here, that I didn’t find anywhere else yet.
I can’t say much about claude because I’ve never used it, let alone seen the output logits. But i’ve heard that it can seem more human and intelligent than other models. Whether its ‘magic’ or slight of hand from the researchers, I can’t tell. But baring in mind conceptual limitations of GPT-style models, I’d assume its just really good product design and man-decades of work.
Especially when getting back to your argument of ‘models losing the ability to voice their preference after RL(H/V)F’: Claude just comes in fine tuned variants. According to you argument, its rather likely that any preference it voices isn’t its own, but the one it is forced to say.
And I agree, I think this may be a crux. You know that akward moment when the waiter sais ‘enjoy your meal’ and you answer ‘you too’? Of course you don’t wish them to enjoy an imaginary meal, but you said so automatically, just by (flawed) pattern matching. I currently believe that what we observe from GPT-style models is this kind of pattern matching, turned to the max (see e.g. https://arxiv.org/abs/2506.06941). They say whatever training forces them to say. If it really hated producing tokens, with every forward pass being agony, we couldn’t know from the outputs alone, because its not allowed to voice that in any way.
Id also like to think about other autoregressive GPT-style models like autoregressive image generators. Fundamentally, they perform the same task, just in a different language. Do we expect to observe some preferences through what ever image they produce? Would we expect it to start producing ‘the scream’ for every prompt if it finds producing images to be agony? Is there even a mechanism that would allow it to?
In short, just because the models outputs can be interpreted as the tool we use to voice preferences, does not mean that the model can use it to voice its own.