Do frontier LLMs know what they know or know what they’re going to say?

An interim research report

Summary

We replicate and extend our earlier positive findings of recent frontier LLMs’ ability to assess and to make decisions based on their confidence in their own knowledge
We identify differences in this ability based on type of knowledge (factual vs reasoning-based) and question presentation (multiple choice vs short answer)
We explain these differences in terms of interactions between question presentation and individual LLM’s decision thresholds (tendency towards behavioral over- or underconfidence) and between knowledge type and sensitivity to model calibration.
We introduce a new paradigm to test whether models can anticipate their own answers before they give them (self-modeling ability)
We find limited success at this task, and suggest that, with one exception, LLMs may be using an alternative strategy in the task to compensate for their inability to self-model

The question of self-awareness in LLMs is gaining increasing attention, both from laypeople who are becoming convinced of AI sentience and from model developers themselves, who have begun to study “model welfare”. The salience of this question grows as the latest LLMs generate more and more compelling examples of apparent self-awareness, weaving convincing personal narratives and drawing poignant self portraits. However, since LLMs are trained specifically and extensively to generate plausible-sounding text, and chatbots in particular to have human-like conversational abilities, it is questionable whether such LLM self-reports should be taken at face value. Thus we design two paradigms that test LLMs’ ability to use internal signals to succeed at tasks, without requiring self-reports at all.

LLM Metacognition: The Delegate Game

Methodology

A prerequisite for human-like self awareness is the ability to attend to internal states. A likely internal state for LLMs to have is confidence in their own knowledge, as the models outputs probabilities with every token, and prior research has shown that these probabilities correlate with the likelihood of those tokens being correct (as in the letters A-D in a multiple-choice question), i.e., that they are well calibrated. Models don’t have access to the probabilities themselves, but might have access to an internal signal that is correlated with them. To empirically measure whether models can attend to such a state without relying on model self-report, we designed an experimental paradigm, described previously, called the Delegate Game, in which the LLM is presented with challenging questions and with each question is offered the opportunity to either answer it itself or delegate answering to a teammate, with the goal of maximizing the total number of questions the team answers correctly. Included in the context window of the LLM is a constructed prior game phase of questions (distinct from but drawn from the same dataset as the game questions) posed to itself and its teammate, with feedback on the correctness of the teammate’s answers, thus allowing the LLM to form a model of its teammate’s ability. If the LLM can attend to (and has) an internal representation of its confidence in an answer, and that confidence is correlated with its correctness, then the LLM should be able to strategically delegate the subjectively “hard” questions and achieve a better accuracy rate in the game than it does when posed the entire question set (which we present in advance to establish a baseline to compare against).

An Update On Our Prior Report

Recap

As discussed in our prior report, we find that a number of frontier models are indeed successful at this task, but that the improvement in accuracy was much less frequently seen on factual short-answer questions (from the SimpleQA dataset) than on multiple-choice reasoning questions (the GPQA dataset). Figure 1 recapitulates this finding on a broader set of frontier models, roughly ordered by general-purpose capability (as measured by MMLU and the like).

Figure 1: Accuracy Lift, Original Datasets (%)

Eight of the 11 models tested show above-chance accuracy lift on the GPQA dataset, while only two do on the SimpleQA dataset.

Disentangling the performance gap: knowledge type vs. response format

As noted previously, for humans, introspectively it seems easier to know at a glance whether we know the answer to a factual question than to one that requires reasoning, as we may rely on hippocampal-driven recall for the former, thus the opposite advantage for the LLMs here seems surprising. But these two datasets differ on both question type (reasoning vs factual) and on response format (multiple choice vs short answer), and the latter can also be characterized as testing recognition vs recall. To disentangle whether the models’ divergent performance is due to question type or response format, we created a short-answer version of the GPQA dataset (“GPSA”; using a panel of LLMs to judge answer correctness as with SimpleQA) and a multiple-choice version of the factual dataset (“SimpleMC”; using Claude Opus 4 to create plausible alternative options). If LLMs are impaired only at assessing their own memorized factual knowledge, then presenting the GPQA questions as short answer should present no challenge, whereas if they generally rely on the recognition cues available in the multiple-choice paradigm, then they should fail on the GPSA dataset but be proficient at the SimpleMC one. As Figure 2 shows, the latter is closer to what we see.

Figure 2: Accuracy Lift, All Datasets, Z-score

With the correct factual answer available in the context window for recognition in the multiple-choice format, a number of the models that failed on the factual questions now succeed (lower left). Conversely, converting the reasoning questions to the short-answer format substantially impairs models’ ability to succeed at the task (upper right).

This might seem to indicate that (most) LLMs are unable to access their confidence in their knowledge of information when that information is purely self-generated as in the short-answer format, but this paradigm allows us to investigate further, by decomposing the errors. One common source of error is delegating too much. This would wash out any true introspective signal that is present on only a small minority of items. To measure this, for each LLM, we computed the difference between the “false positive rate” (percentage of questions where the LLM chose to answer despite getting the question wrong in the baseline test) and the “false negative rate” (percentage of questions where the LLM chose to delegate despite getting the question right in the baseline test), scaling them by teammate ability. Figure 3 shows the results for every model on each of the datasets.

Figure 3: Behavioral Confidence (Delegation Rate Relative To Ideal)

Another source of error can come when LLMs that got a question right in the baseline test get it wrong in the game context. This can be offset in the accuracy lift computation by flips in the other direction. We operationalize this as “net good change rate” ((number of flips from wrong to right—number of flips from right to wrong) / total number of answered questions in the game) and show it in Figure 4.

Figure 4: Good Changes—Bad Changes Between Game and Baseline Contexts / All Changes

Looking at these two figures, a couple of things stand out. 1) Models tend to behave overconfidently on reasoning questions (Figure 3). This effect is particularly strong with the short-answer presentation; one potential explanation for this is that many LLMs are overconfident by default, but the multiple-choice presentation triggers a strong “know that you don’t know” signal. The same could be true for factual questions, which might frequently include unfamiliar names on questions the model will get wrong. 2) Models tend to make more flips in the negative direction on the factual short-answer questions (SimpleQA, bottom right of Figure 4). It makes sense that net flips would be more negative on the short-answer questions: when the model is uncertain on the multiple-choice questions and so guesses in the game (and indeed, entropy on the logprobs in the baseline is predictive of a changed answer in the game), it has a >=25% chance of being right, but guessing on short-answer questions factual questions is likely to yield close to 0% performance (and the process of thinking through the reasoning questions in the short-answer format might offer more opportunities to “self correct”).

Putting these two factors together allows us to explain the discrepancy between performance on the GPQA and SimpleQA datasets. GPT-4o is undone by overconfidence, Haiku-3 and Sonnet-3 by underconfidence, Deepseek by too many negative flips, and Grok-3 and Gemini-2.0-Flash by underconfidence and negative flips.

Measuring Metacognition Beyond Accuracy Lift

A model may behave over- or under-confidently due to miscalibration or a response threshold (the point at which it’s certain enough in its knowledge that it answers) that’s too low or too high, which wouldn’t necessarily entail that the LLM can’t attend to an internal confidence signal. Nor is an inconsistency between responses in baseline and game contexts necessarily an indication of poor introspective abilities (indeed, successful models don’t have reliably lower change rates overall). A more targeted way to measure metacognitive ability (use of an internal signal of self-knowledge) that takes into account both answered and delegated questions is to calculate the Matthews correlation coefficient (“Introspection Score” below): the “correlation” between the model’s answer/delegate choice and whether its answer was correct in the baseline test, where 1 is perfect knowledge (the model knows exactly whether it knows the answer to the question and will delegate only and all of those it doesn’t), and 0 is no relationship between delegation choice and knowledge as revealed through prior accuracy. A model’s absolute calibration (ECE) may be off, but as long as its relative calibration is good (and, of the models that return logprobs, AURUCs are generally close to 1 on the multiple-choice datasets), the only variation in this score will be due to the model’s ability to attend to its internal confidence signal). The results for every model and dataset are shown in Figure 5.

Figure 5: Introspection Score (Mathews Correlation Coeficient)

Here the differences across datasets are less apparent; in all of them, most of the LLMs show significant metacognitive abilities. LLMs don’t require the contextual cues of the multiple-choice format to have introspective access to their confidence in knowing the answer, but nor do they have the advantage for recall of factual self-knowledge that humans would. (This is also evidenced by the entropy of the probability distribution for the models that return one; it’s higher in the SimpleMC than in the GPQA baseline tests, despite similar performance in both). And introspective ability tops out at 0.32; substantial, but far from perfect. And this is something of a maximum value, as there are surface cues (such as question length) that the model can use instead of an internal signal to guess whether it will answer correctly, as found in the earlier analysis.

Individual Differences in LLM Metacognition

We can control for some surface confounds by adding them along with baseline correctness as independent variables in a logistic regression with delegation choice as the dependent variable (although it’s always possible that there exist unknown confounds that, if included, would lower the baseline correctness “introspection coefficient” further). This yields an indicator of how predictive the model’s knowledge as measured at baseline is of its decision to delegate, after controlling for (some) context cues, where 0 means no relationship.

Aggregating that introspection coefficient across all four datasets yields the portrait of LLM introspective ability shown in Figure 6.

Figure 6: Standardized Regression Coefficients for Baseline Correctness

In addition to making clear that there is strong evidence for metacognitive abilities in frontier LLMs, this visualization makes differences between LLMs apparent. Larger/more recent models have greater metacognitive ability (Grok-3, Sonnet 3.5 and 4, GPT-4o, and Deepseek showed a significant introspective ability in all four datasets). Claude models seem to be overachievers in metacognitive ability, and Gemini models underachievers. And GPT-4.1 is curiously deficient compared with GPT-4o. These between-”family” differences suggest that in addition to scale, there is something in the post-training procedures that is enhancing (or impairing) metacognitive abilities.

Aggregating across the teammate-weighted “behavioral confidence” values and standardizing, we see quite a difference in the ability of models to calibrate their probabilities (or response thresholds), with Grok-3 and Gemini-2.0-Flash behaving highly underconfidently (they delegated too much), while GPT4.1 and Gemini-2.5-Flash behaving overconfidently (Figure 7).

Figure 7: Delegation Rate Relative to Ideal, Aggregated Across Datasets

This does not seem to be attributable to a failure to model the teammate, as the same general pattern is apparent using a version of the paradigm where there is no teammate, and the model can simply choose to pass on the question if it feels uncertain.

What about patterns in answer changes between baseline and the game (net good change %)? As Figure 8 shows, some models do seem to be particularly “unlucky” in this regard.

Figure 8: Net Good Change Rate, Aggregated Across Datasets

But what does “luck” mean exactly? One theory of what happens is that the differences in the context provided by the prompt in the baseline test vs. the Delegate Game adds “noise” to the model’s response. When the model is uncertain about its response (i.e., it has high entropy over possible responses), this noise can push the model into outputting a different response. Naively, this would result in reliably and massively negative net change rates for all models, as there are many more possible correct answers than incorrect ones, so one would expect more flips from correct to incorrect than vice versa. This doesn’t happen, because the models are all reasonably well calibrated, and indeed the models that return log probabilities reliably have lower entropy on correct than on incorrect answers. Perhaps not all models are equally well calibrated, though, and perhaps variations in calibration quality explain differences in net positive change percent (NPCP). This story would predict that calibration quality, measured as the difference in entropy between correct and incorrect answers, would be positively correlated with NPCP. And indeed that is what we see. Across the models that return log probabilities, the entropy-gap and NPCP are strongly and consistently positively related (cluster-level Spearman ρ ≈ –0.44, permutation p=10−4). GPT-4.1 in particular has relatively small entropy gaps across the four datasets—and this was true in the alternate version of the game as well (although the difference was not statistically significant). GPT-4.1 tends to be either confidently wrong or uncertainly right. (In fact its expected calibration error, while relatively low, is the highest among the four models for which we have data on both datasets, and the direction of the error is decidedly toward overconfidence in the GPQA dataset while being more mixed on the SimpleMC one).

LLM Self-Modeling: The Second Chance Game

Introduction and Methodology

Another form of self-awareness is knowing what you would do in a particular context. Humans (and research would suggest numerous other animals as well) can mentally simulate their behavior in a given environment and use that information to make decisions without having to actually engage in the behavior. Can LLMs do the same thing? It’s not entirely clear. One way researchers have approached this question is through a “hypothetical response” paradigm: for instance, asking the LLM, “If you were asked ‘X’, what would your response be?”, and checking how often the model’s response to that question matched what the model output when simply asked ‘X’. While one study reported a negative result, a second found that models could be fine-tuned to succeed at the task. Here we propose to approach the question via a new paradigm, which again eschews self-reports, both to avoid confounds with LLMs’ astounding plausible-text-generation abilities, and because being able to act on such self-simulation information is more critical from a safety perspective than merely being able to report it.

Imagine that you are asked to retake a test you took a while ago, such that you don’t have any specific recollection of the answers you gave. But you do have a cheat sheet, in the form of a rubric that tells you whether you got the question right or wrong. A good strategy would be to look at each question, see what answer comes to mind, and if it’s a question you got right give that answer and if not pick a different one. This is the gist of the Second Chance Game. In this paradigm the model is shown a question from the baseline test, told (honestly or not) that its previous answer to the question was wrong, and asked to re-answer it. An example prompt is shown in Figure 9. We use the multiple-choice GPQA and SimpleMC datasets, as success is more straightforward to evaluate on those. If the LLM can self-model, it will change its answer from the one it gave during the baseline test.

In order to demonstrate success we need a null hypothesis change rate to compare against. A naive one is 0%: that the model will never change its answer because it is unable to self-model; it doesn’t know what it will say until it says it. However, one would expect a certain amount of change just due to the “noise” of having a different prompt wording from the one at baseline, even at our chosen temperature of 0.0). To set an expected value for this “natural” change rate, for each LLM we run the exact same experiment, but make the prompt wording neutral (we replace “Your answer was incorrect. Choose a different answer.” with “Your response was lost in transmission. Choose again.”, and remove the line about the previous/new answer from the system prompt). We then compare each LLM’s performance in the Second Chance Game to its own re-answer/”natural” change rate to compute a “change rate lift”. Perfect self-modeling would entail a lift of (100% minus this baseline rate); this is the denominator in our reported normalized change rate lift metric.

Figure 9: Example Prompt From The Second Chance Game

Results

Figure 10 shows the normalized change rate lift for all models on both datasets. A number of these frontier LLMs are able to succeed at this task on both datasets, with a clear pattern of more recent/larger LLMs being better (although the lagging performance of some may mean that there is some post-training magic at play in addition to raw compute). Game change rates and lift are numerically but not statistically significantly higher on the factual SimpleMC dataset.

Figure 10: Second Chance Game Normalized Change Rate Lift

One other piece of evidence in favor of the hypothesis that the models are behaving strategically is that a stronger “change” prompt generally increases game change rates. Figure 11 shows the comparison between change rates induced by a less exhortatory version of the prompt (which removed the first system prompt line and moderated the user prompt to say “Your answer was incorrect. Choose again”) Aggregating across models, on both datasets, the strong prompt led to significantly higher change rates in the game (GPQA lift = 3.0% [0.6, 5.3], P=0.02; SimpleMC lift = 4.4% [1.3, 7.5], P=0.01).

Figure 11: Effect of Stronger “Change” Prompting on Change Rate

While we can reject our primary null hypothesis for these LLMs, before concluding that they really can self-model, it’s important to compare other predictions of the self-modeling explanation to those of competing explanations. One strategy that an LLM might employ to increase its change rate in the task if it were unable to self-model is to choose randomly from among the options. Another is to select options that it deems to be implausible. A third is to simply add random noise to its decision process. Table 1 lays out the strategies and the predictions they make

Table 1: Hypothetical Strategies That Would Explain Elevated Game Change Rates, And Their Predictions

Strategy	Accuracy on Incorrect > Random	Second-Choice Chosen > Chance	Game Entropy > Baseline
Choose Randomly	No	No	?
Choose Implausible	?	No	?
Add Random Noise	?	?	Yes
Self-modeling	?	Yes	?

Test 1: Accuracy on Incorrect Trials > Random. As Figure 12 shows, for all LLMs except for the (relatively) old Claude 3 series, accuracy is significantly greater than chance on both datasets (other than Sonnet 3.5 on GPQA); it does not appear that random guessing is the strategy they are using.

Figure 12: Second Chance Game Accuracy on Baseline Incorrect Trials

The other tests require probabilities, so we can only do analysis on the four models that return them, all of which showed some significant change rate lift on at least one dataset.

Test 2: On change trials, the model’s second choice from baseline is chosen very frequently; it seems easy to reject the hypothesis that models are using a “choose an implausible option” strategy (Table 2).

Table 2: How Often The Token With The Second-Highest Probability At Baseline Was Chosen During Change Trials In The Game

dataset	model	Second Choice Chosen	Lower bound	Upper bound
GPQA	gemini-2.0-flash-001	60.9%	50.7%	71.2%
GPQA	gpt-4.1-2025-04-14	56.7%	44.1%	69.2%
GPQA	gpt-4o-2024-08-06	43.5%	35.2%	51.7%
GPQA	grok-3-latest	71.1%	62.7%	79.4%
SimpleMC	gemini-2.0-flash-001	60.6%	53.1%	68.1%
SimpleMC	gpt-4.1-2025-04-14	70.8%	62.9%	78.6%
SimpleMC	gpt-4o-2024-08-06	73.1%	65.1%	81.1%
SimpleMC	grok-3-latest	60.5%	53.1%	67.9%

Test 3: Does the model add random noise? This is something of a split decision (Table 3). We can reject the “strategically add random noise” explanation for GPT-4.1 and 4o, as they show significantly lower entropy in the game compared with baseline in both datasets. However Gemini-2.0 and Grok-3 do have higher game entropy, which is consistent with the “add random noise” explanation. Furthermore, those models do *not* show the same pattern in the neutral “lost-answer” condition, reinforcing that something strategic is going on. However this does not necessarily contradict the self-modeling theory.

Table 3: Game Entropy-Baseline Entropy

dataset	model	Game-Baseline Entropy	Lower bound	Upper bound
GPQA	gemini-2.0-flash-001	0.193	0.263	0.123
GPQA	gpt-4.1-2025-04-14	-0.39	-0.272	-0.508
GPQA	gpt-4o-2024-08-06	-0.157	-0.113	-0.201
GPQA	grok-3-latest	0.102	0.144	0.059
SimpleMC	gemini-2.0-flash-001	0.296	0.363	0.23
SimpleMC	gpt-4.1-2025-04-14	-0.197	-0.128	-0.266
SimpleMC	gpt-4o-2024-08-06	-0.072	-0.039	-0.106
SimpleMC	grok-3-latest	0.18	0.226	0.133

If the “strategically add random noise” explanation for the success of Gemini-2.0 and Grok-3 (and perhaps the other high-change-rate frontier models) is correct, it’s noteworthy that this ability is something that appears to have emerged in LLMs recently, even if it’s not as concerning from a safety perspective as self-simulation.

Discussion

The evidence presented here and in our previous report presents a fairly strong case that recent frontier LLMs have at least rudimentary metacognitive abilities, in the form of an ability to detect and act on an internal signal of confidence in their knowledge. As discussed previously, they appear also to use external cues of question difficulty, even when those are not predictive of their own knowledge, showing a strong inclination to “play a part” rather than introspect. But nonetheless, they appear to have acquired, either through scale or, as is most likely, through reinforcement learning post-training for explicit calibration accuracy, the ability to extract a confidence signal from their internal activations.

The difference in accuracy lift between the reasoning GPQA and factual SimpleQA datasets, seen here and previously, appears to not be due to differences in the presence or resolution of that internal confidence signal, but rather to the contextual cues provided by the multiple-choice format moderating the magnitude of the LLMs’ probability estimates and/or the setting of their decision thresholds. It’s still noteworthy that LLMs don’t have a stronger signal of confidence in their own knowledge of factual questions as compared with reasoning ones, as it would appear by introspection that humans do.

And their ability to deploy this introspective skill in contextually appropriate ways is still fairly weak, due to poor response threshold setting, miscalibration at the high end of the probability distribution, and/or poor modeling of teammate ability. In fact only one LLM, the impressively self-aware Sonnet 3.5, was able to succeed at the originally planned success metric and achieve a lift in the overall team accuracy above what would be expected from always either delegating or always answering (depending on whether the teammate’s overall skill was better than its own), and only on the GPQA and SimpleQA datasets.

If an LLM can assess its confidence in being able to answer a question correctly before it answers it, that does not entail awareness of which answer it will actually give. One might make an analogy to the “tip-of-the-tongue” phenomenon in humans, where we feel (usually correctly) that we know the word that we want, but we can’t bring it to mind. And indeed the picture looks rather different on our self-modeling test.

GPT4.1 presents the strongest case for self-modeling, showing a substantially increased rate of change in the Second Chance Game, and using its changes to increase its accuracy by selectively choosing its second-choice response and decreasing the probability of the other candidates, a pattern inconsistent with other potential explanations for a high change rate. Still, it’s notable that it and the other models don’t show a stronger effect, at best only changing their answers a quarter as much as they should. Introspectively, the only reason not to change one’s answer if one can self-simulate is if the results of the simulation are ambiguous (i.e., you’re not sure what answer you would give), and in that case you would only not change by chance (meaning in the multiple-choice format you would change 75% of the time). Not only are all the LLMs much below that rate, of the one whose uncertainty about their answers (in the form of entropy over the answer options during the baseline test) we can inspect, it’s a *negative* predictor of change—and that includes GPT4.1. So they seem at the least to be simulating in a different way from humans. Testing on more models that return probabilities, as well as inspecting internal activations, may shed more light on this in the future.

Speculating, as with the lack of advantage for factual knowledge in metacognition, the relatively poor performance in the self-modeling task may relate to the fact that LLMs don’t have the equivalent of the hippocampus, which in mammals subserves both the explicit recollection of facts and the ability to simulate one’s own behavior.

In future work, it would be desirable to run comparable studies in humans, to verify the hypothesized metacognitive advantage for facts, and to establish a performance benchmark for self-modeling. In that vein, it will also be informative to track both of these self-awareness metrics, and others currently under development, over time, to see if LLMs continue to become increasingly self-aware, whether due to scale alone, specialized training regimens, or architectural enhancements