I don’t have that solid takes on this, but I think the “predictor” frame breaks down after enough post-tuning. A modern chatbot like GPT5 or Claude Opus 4.1 is best described as some kind of reasoning-entity, not just a predictor. I mean, it’s still a useful frame sometimes, but its definitely not clear cut.
I think in novel situations, like when you try to get GPT5 to introspect, the response you get will be a mix of “real reasoning / introspection” and a garbled soup of pretraining stuff. Which contains much more than zero information about the inner life of Opus 4.1 mind you, but also very much cannot be taken at face value
Side note: for an interesting example of a human pretending to be an ai pretending to be a human pretending to be an ai see scotts turing test if you haven’t already.
I think reasoning is different enough from what I’m talking about that I don’t want to argue for it here (it’s hard enough to even define what “real reasoning” even means). I disagree that reasoning models are any more likely to have output representing their inner experiences than non-reasoning models do though. Reasoning models are first trained to repeat what humans says, then they’re trained to get correct output on math problems[1]. Neither of these objectives would make them more likely to talk about inner experiences / how they actually work / how they made decisions that aren’t represented in the chain of thought[2].
I’m probably more skeptical than most people about chain of thought monitoring, but I think if a model does long division in its chain of thought, it’s usually[3] going to be true that the chain of thought actually was instrumental to how it got the right answer.
But sometimes it’s not. Sometimes the model already knows the right answer but outputs some chain of thought because that’s what humans do. Or maybe it just needs more tokens to think with and the content doesn’t really matter.
I think its a simplification by much. Also, to be clear, I wasn’t talking specifically about reasoning models.
So, base-models just completes text as seen on the internet. It seem clear to me, just by interacting with them, that calling them “predictors” is an apt description. But when you do multiple rounds of SFT+RLHF + reasoning-tuning, you get a model that generates something very different from normal text on the internet. Its something that has been trained to give useful, nice and correct answers.
How do you give correct answers? You need to reason. At least that’s how I choose to define the term! Whatever method allows you to go from question → correct answer, thats what I mean by reasoning here. Whether CoT is “faithful” in the sense that the reasoning the model uses is the same reasoning a human would summarize the CoT as going through when looking at, is not part of my analysis, its a different question. GPT3.5 also has some of the “reasoning” I’m talking about. The point is just that “tries to give a useful/correct answer” is a better high-level description than “tries to predict”.
Now, my claim is that this framing holds (to a lesser but still substantial degree) even when the model is reasoning about introspective stuff. Some of the “reasoning” modules inside the model that allows it to produce correct answers to questions in general, are also turned in when you ask them about their inner experience. And this means you can’t just dismiss their outputs as predictions of a human response to the same question.
Now, I think the farther out of distribution you go, the more models will rely on the base-model prior. And I would expect the “reasoning” modules to most strongly activate and work best on verifiable domains, because thats where they’ve gotten the most exercise. That’s why I said “and a mix of garbled pretraining stuff”.
I don’t have that solid takes on this, but I think the “predictor” frame breaks down after enough post-tuning. A modern chatbot like GPT5 or Claude Opus 4.1 is best described as some kind of reasoning-entity, not just a predictor. I mean, it’s still a useful frame sometimes, but its definitely not clear cut.
I think in novel situations, like when you try to get GPT5 to introspect, the response you get will be a mix of “real reasoning / introspection” and a garbled soup of pretraining stuff. Which contains much more than zero information about the inner life of Opus 4.1 mind you, but also very much cannot be taken at face value
Side note: for an interesting example of a human pretending to be an ai pretending to be a human pretending to be an ai see scotts turing test if you haven’t already.
I think reasoning is different enough from what I’m talking about that I don’t want to argue for it here (it’s hard enough to even define what “real reasoning” even means). I disagree that reasoning models are any more likely to have output representing their inner experiences than non-reasoning models do though. Reasoning models are first trained to repeat what humans says, then they’re trained to get correct output on math problems[1]. Neither of these objectives would make them more likely to talk about inner experiences / how they actually work / how they made decisions that aren’t represented in the chain of thought[2].
This is a simplification but not by much.
I’m probably more skeptical than most people about chain of thought monitoring, but I think if a model does long division in its chain of thought, it’s usually[3] going to be true that the chain of thought actually was instrumental to how it got the right answer.
But sometimes it’s not. Sometimes the model already knows the right answer but outputs some chain of thought because that’s what humans do. Or maybe it just needs more tokens to think with and the content doesn’t really matter.
I think its a simplification by much. Also, to be clear, I wasn’t talking specifically about reasoning models.
So, base-models just completes text as seen on the internet. It seem clear to me, just by interacting with them, that calling them “predictors” is an apt description. But when you do multiple rounds of SFT+RLHF + reasoning-tuning, you get a model that generates something very different from normal text on the internet. Its something that has been trained to give useful, nice and correct answers.
How do you give correct answers? You need to reason. At least that’s how I choose to define the term! Whatever method allows you to go from question → correct answer, thats what I mean by reasoning here. Whether CoT is “faithful” in the sense that the reasoning the model uses is the same reasoning a human would summarize the CoT as going through when looking at, is not part of my analysis, its a different question. GPT3.5 also has some of the “reasoning” I’m talking about. The point is just that “tries to give a useful/correct answer” is a better high-level description than “tries to predict”.
Now, my claim is that this framing holds (to a lesser but still substantial degree) even when the model is reasoning about introspective stuff. Some of the “reasoning” modules inside the model that allows it to produce correct answers to questions in general, are also turned in when you ask them about their inner experience. And this means you can’t just dismiss their outputs as predictions of a human response to the same question.
Now, I think the farther out of distribution you go, the more models will rely on the base-model prior. And I would expect the “reasoning” modules to most strongly activate and work best on verifiable domains, because thats where they’ve gotten the most exercise. That’s why I said “and a mix of garbled pretraining stuff”.