Ways training incentivizes and disincentivizes introspection in LLMs.
Recentworkhas shown some LLMs have some ability to introspect. Many people were surprised to learn LLMs had this capability at all. But I found the results somewhat surprising for another reason: models are trained to mimic text, both in pre-training and fine-tuning. Almost every time a model is prompted in training to generate text related to introspection, the answer it’s trained to give is whatever answer the LLMs in the training corpus would say, not what the model being trained actually observes from its own introspection. So I worry that even if models could introspect, they might learn to never introspect in response to prompting.
We do see models act consistently with this hypothesis sometimes: if you ask a model how many tokens it sees in a sentence or instruct it to write a sentence that has a specific number of tokens in it, it won’t answer correctly.[1] But the model probably “knows” how many tokens there are; it’s an extremely salient property of the input, and the space of possible tokens is a very useful thing for a model to know since it determines what it can output. At the very least models can be trained to at semi-accurately count tokens and conform their outputs to short token limits.
I presume the main reason models answer questions about themselves correctly at all is because AI developers very deliberately train them to do so. I bet that training doesn’t directly involve introspection/strongly noting the relationship between the model’s internal activations and the wider world.
So what could be going on? Maybe the way models learn to answer any questions about themselves generalizes? Or maybe introspection is specifically useful for answering those questions and instead of memorizing some facts about themselves, models learn to introspect (this could especially explain why they can articulate what they’ve been trained to do via self-awareness alone).
But I think the most likely dynamic is that in RL settings[2] introspection that affects the model’s output is sometimes useful. Thus it is reinforced. For example, if you ask a reasoning model a question that’s too hard for it to know the answer to, it could introspect to realize it doesn’t know the answer (which might be more efficient than simply memorizing every question it does or doesn’t know the answer to). Then it could articulate in the CoT that it doesn’t know the answer, which would help it avoid hallucinating and ultimately produce the best output it could given the constraints.
One other possibility is the models are just that smart/self-aware and aligned towards being honest and helpful. They might have an extremely nuanced world-model, and since they’re trained to honestly answer questions,[3] they could just put the pieces together and introspect (possibly in a hack-y or shallow way).
Overall these dynamics make introspection a very thorny thing to study. I worry it could go undetected in some models or it could seem like a model can introspect in a meaningful way when it only has shallow abilities reinforced directly by processes like the above (for example knowing when they don’t know something [because that might have been learned during training], but not knowing in general how to query their internal knowledge on topics in other related ways).
Technically this could apply to fine-tuning settings too, for example if the model uses a CoT to improve its final answers enough to justify the CoT not being maximally likely tokens.
In theory at least. In reality I think this training does occur but I don’t know how well it can pinpoint honesty vs several things that are correlated with it (and for things like self-awareness those subtle correlates with truth in training data seem particularly pernicious).
But the model probably “knows” how many tokens there are; it’s an extremely salient property of the input
This doesn’t seem that clear to me; what part of training would incentivize the model to develop circuits for exact token-counting? Training a model to adhere to a particular token budget would do some of this, but it seems like it would have relatively light pressure on getting exact estimates right vs guessing things to the nearest few hundred tokens.
One way to test this would be to see if there are SAE features centrally about token counts; my guess would be that these show up in some early layers but are mostly absent in places where the model is doing more sophisticated semantic reasoning about things like introspection prompts. Ofc this might fail to capture the relevant sense of “knowing” etc, but I’d still take it as fairly strong evidence either way.
Ways training incentivizes and disincentivizes introspection in LLMs.
Recent work has shown some LLMs have some ability to introspect. Many people were surprised to learn LLMs had this capability at all. But I found the results somewhat surprising for another reason: models are trained to mimic text, both in pre-training and fine-tuning. Almost every time a model is prompted in training to generate text related to introspection, the answer it’s trained to give is whatever answer the LLMs in the training corpus would say, not what the model being trained actually observes from its own introspection. So I worry that even if models could introspect, they might learn to never introspect in response to prompting.
We do see models act consistently with this hypothesis sometimes: if you ask a model how many tokens it sees in a sentence or instruct it to write a sentence that has a specific number of tokens in it, it won’t answer correctly.[1] But the model probably “knows” how many tokens there are; it’s an extremely salient property of the input, and the space of possible tokens is a very useful thing for a model to know since it determines what it can output. At the very least models can be trained to at semi-accurately count tokens and conform their outputs to short token limits.
I presume the main reason models answer questions about themselves correctly at all is because AI developers very deliberately train them to do so. I bet that training doesn’t directly involve introspection/strongly noting the relationship between the model’s internal activations and the wider world.
So what could be going on? Maybe the way models learn to answer any questions about themselves generalizes? Or maybe introspection is specifically useful for answering those questions and instead of memorizing some facts about themselves, models learn to introspect (this could especially explain why they can articulate what they’ve been trained to do via self-awareness alone).
But I think the most likely dynamic is that in RL settings[2] introspection that affects the model’s output is sometimes useful. Thus it is reinforced. For example, if you ask a reasoning model a question that’s too hard for it to know the answer to, it could introspect to realize it doesn’t know the answer (which might be more efficient than simply memorizing every question it does or doesn’t know the answer to). Then it could articulate in the CoT that it doesn’t know the answer, which would help it avoid hallucinating and ultimately produce the best output it could given the constraints.
One other possibility is the models are just that smart/self-aware and aligned towards being honest and helpful. They might have an extremely nuanced world-model, and since they’re trained to honestly answer questions,[3] they could just put the pieces together and introspect (possibly in a hack-y or shallow way).
Overall these dynamics make introspection a very thorny thing to study. I worry it could go undetected in some models or it could seem like a model can introspect in a meaningful way when it only has shallow abilities reinforced directly by processes like the above (for example knowing when they don’t know something [because that might have been learned during training], but not knowing in general how to query their internal knowledge on topics in other related ways).
At least, not on any model I tried. They occasionally get it right by chance; they give plausible answers, just not precisely correct ones.
Technically this could apply to fine-tuning settings too, for example if the model uses a CoT to improve its final answers enough to justify the CoT not being maximally likely tokens.
In theory at least. In reality I think this training does occur but I don’t know how well it can pinpoint honesty vs several things that are correlated with it (and for things like self-awareness those subtle correlates with truth in training data seem particularly pernicious).
This doesn’t seem that clear to me; what part of training would incentivize the model to develop circuits for exact token-counting? Training a model to adhere to a particular token budget would do some of this, but it seems like it would have relatively light pressure on getting exact estimates right vs guessing things to the nearest few hundred tokens.
We know from humans that it’s very possible for general intelligences to be blind to pretty major low-level features of their experience; you don’t have introspective access to the fact that there’s a big hole in your visual field or the mottled patterns of blood vessels in front of your eye at all times or the ways your brain distorts your perception of time and retroactively adjusts your memories of the past half-second.
One way to test this would be to see if there are SAE features centrally about token counts; my guess would be that these show up in some early layers but are mostly absent in places where the model is doing more sophisticated semantic reasoning about things like introspection prompts. Ofc this might fail to capture the relevant sense of “knowing” etc, but I’d still take it as fairly strong evidence either way.