williawa comments on What would a human pretending to be an AI say?

williawa 9 Aug 2025 20:55 UTC
4 points
0
I think its a simplification by much. Also, to be clear, I wasn’t talking specifically about reasoning models.
So, base-models just completes text as seen on the internet. It seem clear to me, just by interacting with them, that calling them “predictors” is an apt description. But when you do multiple rounds of SFT+RLHF + reasoning-tuning, you get a model that generates something very different from normal text on the internet. Its something that has been trained to give useful, nice and correct answers.
How do you give correct answers? You need to reason. At least that’s how I choose to define the term! Whatever method allows you to go from question → correct answer, thats what I mean by reasoning here. Whether CoT is “faithful” in the sense that the reasoning the model uses is the same reasoning a human would summarize the CoT as going through when looking at, is not part of my analysis, its a different question. GPT3.5 also has some of the “reasoning” I’m talking about. The point is just that “tries to give a useful/correct answer” is a better high-level description than “tries to predict”.
Now, my claim is that this framing holds (to a lesser but still substantial degree) even when the model is reasoning about introspective stuff. Some of the “reasoning” modules inside the model that allows it to produce correct answers to questions in general, are also turned in when you ask them about their inner experience. And this means you can’t just dismiss their outputs as predictions of a human response to the same question.
Now, I think the farther out of distribution you go, the more models will rely on the base-model prior. And I would expect the “reasoning” modules to most strongly activate and work best on verifiable domains, because thats where they’ve gotten the most exercise. That’s why I said “and a mix of garbled pretraining stuff”.