Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities
1. Summary and overview
LLMs seem to lack metacognitive skills that help humans catch errors. Improvements to those skills might be net positive for alignment, despite improving capabilities in new directions.
Better metacognition would reduce LLM errors by catching mistakes, and by managing complex cognition to produce better answers in the first place. This could stabilize or regularize alignment, allowing systems to avoid actions they would not “endorse on reflection” (in some functional sense).[1] Better metacognition could also make LLM systems useful for clarifying the conceptual problems of alignment. It would reduce sycophancy, and help LLMs organize the complex thinking necessary for clarifying claims and cruxes in the literature.
Without such improvements, collaborating with LLM systems on alignment research could be the median doom-path: slop, not scheming. They are sycophantic, agreeing with their users too much, and produce compelling-but-erroneous “slop”. Human brains produce slop and sycophancy, too, but we have metacognitive skills, mechanisms, and strategies to catch those errors. Considering our metacognitive skills gives some insight into how they might be developed for LLMs, and how they might help with alignment (§6, §7).
I’m not advocating for this. I’m noting that work is underway, noting the potential for capability gains, and noting the possibility that the benefits for alignment outweigh the danger from capability improvements. I’m writing about this because I think plans for alignment work should take these possibilities into account.[2]
I’ll elaborate on all of that in turn.
I hypothesize that metacognitive skills constitute a major part of the “dark matter of intelligence”[3] that separates LLMs and LLM agents from human-level competence. I (along with many others) have spent a lot of time wondering why LLMs appear so intelligent in some contexts, but wildly incompetent in others. I now think metacognitive skills are a major part of the answer,[4] and their role is mostly (although not entirely) overlooked. I think it’s overlooked because these skills are largely automatized and so non-conscious, much like an expert skier can’t identify most of the component sensorimotor and cognitive skills that comprise their expertise.
I address metacognitive skills along with two related concepts: specialized neural mechanisms and explicit metacognitive strategies. Considering the full range provides a better intuition for how they may be helpful for humans and how they might be implemented or trained for LLMs.
Metacognitive skills:
Skills for managing and evaluating our own cognition
Metacognitive neural mechanisms for detecting uncertainty
Similar signals exist in LLMs (§5)
Metacognitive strategies
Explicit strategies like saying, writing or thinking “I should look for errors in my setup for math story problems”
On a continuum opposite fully automated metacognitive skills
Explicit strategies could substitute for human-like fluent skills
If LLM systems think faster or cheaper
Here I am often compressing all of these into just “metacognitive skills” for brevity, but it’s worth considering each individually. More in the next section §2.
One recent study provides strong evidence for what I’d suspected: reasoning LLMs still do less and worse metacognition than humans, and this leads to long and inefficient chains of thought (§4).
There is a nontrivial amount of empirical work on improving LLMs’ cognition through training, scaffolding, multiple systems, and prompting (§5). I discussed some of these and other likely approaches in more depth in System 2 Alignment: Deliberation, Review, and Thought Management. Given the potential of those approaches, it seems likely that the metacognitive gap will be narrowed or closed in near-future LLMs.
There are two alignment payoffs for better metacognition. I discuss deconfusion help on alignment research in §6 and alignment stability and regularization in §7.
Of course better metacognition for error-catching would also improve general capabilities and accelerate progress toward recursively self-improving AI.[2]
Elaborations and evidence follow. I’ll keep it fairly succinct. The sections can be read in any order without much loss, which has created a little redundancy.
2. Human metacognitive skills and why we don’t notice them
Metacognition is cognition-about-cognition. This topic is explored in cognitive psychology and neuroscience, but not thoroughly or systematically, particularly for complex cognition. The importance of metacognitive skills for complex thought has been part of my thinking, and to a lesser extent myy research, since I read Daniel Dennett on “microhabits of thought” 25 years ago. I now think it’s a big part of why LLM agents are so incompetent despite LLMs being so smart in some ways.
Here are just a few examples of human metacognition; I suspect there are many, many more.
Occasionally asking where you’re at in a complex question and what you should think about next
Before switching topics, spending a moment trying to remember provisional conclusions and points of uncertainty
Steelmanning the case against your favored conclusions
Estimating a conclusion’s importance before deciding to accept it and move on
Much of the skill in each of these is remembering to do it in the appropriate context.
Hearing the phrase “what’s three plus negative five?” and responding “negative two” from memory is a cognitive skill. So is recalling the algorithms for working out the answer, and going through that chain of thought. Thinking “better double-check that logic before answering” is metacognition; it is about your own thinking. Thinking that consistently when it’s appropriate is a metacognitive skill.
If such a thought is explicit, it’s what I’m calling a metacognitive strategy. With repetition, those thoughts become more automatic. They become faster and more compressed, and therefore harder to notice and think about. Such automatic responses probably make up most of our metacognitive skills. Stopping to search memory for a strategy is a learned skill that results in part from brain mechanisms particularly suited for learning that skill. I describe these briefly in the next section.
I think we’re not aware of the importance and prevalence of metacognitive skills because they’re mostly automatic and therefore hard to notice. They are probably more idiosyncratic and personal than other skills like acrobatics or writing. They’re harder to talk about or teach in part because they’re less visible. This also contributes to our not thinking about them much.
There’s no sharp line between explicit strategies and automatic skills; automaticity or habitization happens with repetition, so any given skill is somewhere on a spectrum between fully deliberate/explicit and fully habitual/automatic. I think we’ve learned a bunch of important metacognitive skills, but automated them and so forgotten what they are—just like we’ve forgotten all of the many strategies we thought about while developing other skills, like athletic performance. The difference is that we can more easily see and so discuss and teach skills that are on display outside of our own heads.
Metacognitive skills may range very broadly. The psychological literatures I’ve found only attempt to categorize them broadly (see empirical evidence below for an example), and have no methodology for identifying or studying finer-grained skills.
2.1. Brain mechanisms for metacognition
Humans have specific brain mechanisms that aid with our metacognitive skills. For instance, much has been made in the neuroscience literature of signals of conflict and errors in the anterior cingulate cortex and other brain regions. My master’s thesis dealt tangentially with this in 2003, and my work since then has dealt with it in different ways. These brain mechanisms have specific anatomical and evolutionary origins. But in sum I think the conflict detection mechanisms studied in neuroscience work pretty similarly to training a classifier on the underlying complex representations, as in the Zhang et al., 2025 study I review below.
The brain mechanisms for metacognition that we know about seem to arise from the same RL mechanisms that learn motor actions. They teach the brain to stop and try other strategies before taking an action we’ve learned is wrong in important ways.
There are also specific circuits in the anterior cingulate that learn to measure physical effort relative to predicted reward and punishment. Similar circuits may be learning to register mental effort, which is important for wisely allocating mental time where it’s most useful. All of these seem to be particular applications of the brain’s dopamine-centered reinforcement learning (RL) process, using inputs selected by evolution to guide their priors in useful ways.
One key ingredient for metacognition is “stopping to think” at appropriate points. There are RL mechanisms that learn to stop physical or mental actions that predict negative value. These mechanisms center on the indirect pathway of the basal ganglia and the surrounding dopamine reward prediction circuitry. see my paper Neural mechanisms of human decision-making and the many refs there for way more than you wanted to know.
I don’t think the details are terribly relevant, although looking at them more closely could probably provide inspirations for approaches in LLM training and scaffolding for similar purposes. I won’t dig in farther here.
Work reviewed in §5 explores some routes of adding similar mechanisms to LLMs. Some of it emulates these specific brain mechanisms; some focuses on training for similar responses; and some uses explicit scaffolding. I think these and other straightforward routes seem likely to work, at least to some degree. I elaborated on some likely-seeming mechanisms in System 2 Alignment.
In sum, nobody knows how many metacognitive skills humans have, exactly how they’re learned, or how important they are to cognition. I’d guess there are many, and they’re quite important. And I think LLMs have fewer and worse ones, and that this probably plays a big role in why they produce (even) more slop and errors than humans.
3. Why we might expect LLMs’ metacognitive skills to lag humans’
First I’ll say why we might expect this on priors, then in the next section we’ll review the evidence from the one study that directly compares human and LLM metacognition.
LLMs sure seem to lack metacognitive skills. LLMs’ responses seem overconfident relative to humans with similar knowledge. This seems to cause a lot of problems for their thinking, and indirectly, for the thinking of the humans they’re collaborating with. Their responses and lines of thinking seem (at least to me) to resemble humans who don’t bother to check their logic unless someone tells them to. This is highly subjective, so I’m not basing much on it. To me, lack of metacognitive skills (and memory) seems to explain much of why some careful thinkers like Kaj Sotala and Thane Ruthenis are skeptical that LLMs will reach AGI any time soon on the current progression.
I mentioned above that metacognitive skills might be only weakly implicit in the text corpus, and so harder to learn with LLM training methods relative to humans. I’ll go into this just a little more, but it’s not critical, so skip ahead if you like.
Semantics and grammar is strongly implicit in text corpora, so LLMs master these first. Reasoning is weakly implicit, at a second level of remove from the word choices. And managing and organizing that reasoning is more weakly implicit. Some texts describe the rules of reasoning; fewer describe the rules of thinking-about-thinking. Fewer still describe the skills of metacognition or thought management themselves. RL training on tasks that demand metacognition should help, but RL might do more to select skills from the supervised pretraining than to build them.[5] This would reduce its effectiveness for building metacognitive skills.
Humans’ active, self-directed learning might work better for developing metacognitive skills with limited external feedback. Our style of self-directed continual learning allows us to form hypotheses, test them, then turn those hypotheses into skills through self-directed practice. This could be a substantial advantage in learning metacognitive skills among other performative skills. I review these ideas in LLM AGI will have memory. In sum, efforts are underway, and even modest improvements could increase the pace of LLM improvements. But this speculation is only tangentially relevant to whether LLMs currently lag humans disproportionately in metacognitive skills.
There’s just one study that explicitly compares human and reasoning LLM metacognition.
4. Evidence that LLM metacognition lags humans’
Cognitive Foundations for Reasoning and Their Manifestation in LLMs (Kargupta et al., Nov. ’25) is a cross-disciplinary effort between ML and cognitive psychology authors. They analyzed human think-aloud protocols, and reasoning traces from 18 models on the same problems, looking for different types of reasoning, including metacognition. What they found supports what I’d been thinking: humans have better strategies and skills for organizing their thoughts and finding their errors, and we do more of it.
When they compare human think-aloud and LLM CoT, humans spend much more time thinking strategically about their thinking. LLMs seem to have metacognitive behaviors in their repertoire but fail to deploy them spontaneously and adaptively. To me, this strongly suggests an overhang and low-hanging fruit for improved capabilities. That’s a big part of why I’m writing this now.
They report that, as problems become less structured, models narrow their cognitive strategies when they should diversify. They “resort to surface level reiteration and enumeration” and “fail at learning from previous verifications”—often repeating checks on claims they’ve already verified.[4] They say humans are “quicker to invoke conceptual processing and abstraction… leading to significantly shorter reasoning traces.”
This study divides types of metacognition into the following categories. I’ll list these as evocative of some varieties of metacognition, rather than authoritative; research on metacognition in expert performance is thin.
Self-awareness—detects capabilities and limitations.
Context awareness—identifies situational demands.
Strategy selection—responds by choosing appropriate approaches.
Goal management—directs the response through structured sub-goals.
Evaluation monitors—progress and triggers adaptation when needed.
The paper also found that scaffolding can help substantially on some problems and for some models—but almost as often, it backfired and reduced performance. And that was with directed prompting. They prompted models to do the type of cognition that most helped on that task, inserting it in traces that had succeeded and failed. There was a tendency for the weaker models to be hurt more often. But they didn’t do this on DeepSeek R1, the strongest model they tested (probably due to sharp academic budget limitations). So there’s no clear evidence whether such scaffolding/prompting strategies hold more or less promise on SOTA and future models.
There’s plenty of speculation in the ML literature that LLMs lack metacognitive skills. Other studies show indirect evidence in that direction.
Reasoning models actually seem worse than older models at one type of metacognition: recognizing they don’t know an answer. AbstentionBench evaluates several reasoning-tuned models and finds that they often perform worse than non-reasoning models at abstaining or asking for clarification on unanswerable or underspecified questions (Kirichenko et al., 2025). In multiple cases, models express uncertainty in their reasoning trace while still producing a confident final answer. This suggests that uncertainty-related signals are not consistently governing action selection, and that reasoning training can even harm metacognitive skills if it’s not specifically directed at improving them.
5. Current approaches to improving metacognition in reasoning models
Other work on metacognition in reasoning models is consistent with the conclusions of the Kargupta et al study. And it shows some other possible routes to improving LLMs’ metacognition. I’m leaving out earlier attempts like tree-of-thought (ways of scaffolding in some particular thinking strategies) since all of those have largely been eclipsed by RL-trained reasoning models. The evidence on improving metacognition in reasoning models is suggestive but doesn’t convincingly demonstrate how well those approaches will work relative to just keeping on scaling. But I suspect there is some low-hanging fruit to be plucked by even modest specific focus on this area.
Here’s a quick summary of the most relevant work I’ve found.
Related work suggests that metacognitive signals are present, but weakly used in early open-source reasoning models. Training linear classifiers reveals representations that correlate with correctness and can be exploited by external controllers to reduce token use without degrading accuracy (Zhang et al., 2025). This information is fairly robust, but does not generalize all that well among topics. Humans’ metacognitive abilities seem to vary with their expertise on a given topic. These signals might be fairly analogous to the conflict and error signals in the brain. Evidence that models have but don’t fully use these signals is one of the strongest indicators that there’s low-hanging fruit to be exploited.
Several groups have attempted to compensate for the metacognitive gap using explicit scaffolding. Meta-R1 introduces a two-level architecture in which a separate meta-process plans, monitors, and enforces stopping behavior for a reasoning model (Dong et al., Aug 2025). This improves efficiency and sometimes accuracy, by treating metacognition as an architectural add-on rather than a skill the base model deploys automatically.
SSR: Socratic Self-Refine for Large Language Model Reasoning (Nov 25) is another scaffolding-style method: the model iteratively refines its own solution, but with a structured “Socratic” question-answer decomposition and checking for each step rather than freeform “try again.” They used hard math-reasoning settings, including a text-only math subset of Humanity’s Last Exam (HLE). They report that SSR beats both plain Chain-of-Thought and a Self-Refine baseline across model scales, including on a strong frontier model (“full GPT-5,” medium reasoning/verbosity).
More targeted evidence comes from Double-Checker, which studies long-CoT reasoning models and concludes that they often fail to generate informative critiques by default, repeatedly reproducing the same error (Liu et al., 2025). They show that modest amounts of critique-focused training data combined with structured refinement loops can produce large gains on difficult math benchmarks. This suggests that self-critique can be learned as a skill, but is not a generic consequence of reasoning training.
Such fine-tuning for better critiques could be combined with the finding that even simple critiques work when iterated enough and even crudely aggregated (at least in some domains). Deep Self-Evolving Reasoning shows that long-horizon iterative refinement applied to a DeepSeek-R1-family model can solve some problems that single-pass reasoning and majority voting fail on (Liu et al., Oct. 2025). These are simple prompts, roughly “critique the last pass and try another”, iterated at great length, then implementing voting over the last few examples. This is inefficient as they implement it; it goes to millions of reasoning tokens for a single problem.
Humans often recognize when a question is important enough to deserve a lot of effort. This study indicates that even simple scaffolding approaches let current-gen models convert extra computation into improved accuracy, at least for math problems. I suspect that more open-ended questions can benefit as much, from slightly more sophisticated scaffolding/prompting strategies. These might be something like “come up with some different angles on this question, base judgments on each, then aggregate across them.” This is a technique that human experts sometimes mention explicitly in forming judgments on complex topics; note this structure in high-quality reviews of art or programming techniques.
One of the conclusions of Kargupta et al was that “models possess behavioral repertoires associated with success but fail to deploy them spontaneously.” This suggests that substantial unhobbling was still available for open-weight reasoning models (Qwen3, DeepSeek-R1 distillations, Olmo-3, OpenThinker, etc.). Newer SOTA models with elaborate chain-of-thought probably have somewhat better metacognitive skills; GPT5 and Gemini 3 seem to use parallel searches and show better planning that could result from scaffolding and/or training for metacognition. But I’d strongly guess that many of the weaknesses that study found in R1 and other open reasoning models persist in the current generation, and thus there’s some level of overhang if metacognition is specifically improved.
For more speculation on how human metacognition might inspire improvements in LLMs, see Toward Artificial Metacognition (Nov 2025).
Can metacognition in LLMs be improved beyond just scaling current methods? Probably. Will it be soon? Published studies aren’t all that helpful in guessing. I think the strongest reason to expect this is that it might squeeze more efficiency out of any given model, so there’s an incentive for developers to work on it. And some of the many possible approaches probably hold low-hanging fruit. Another route to progress on scaffolding is through increased individual experimentation. As more people use Claude Code and similar systems, plugins make it easy to experiment with different scaffolding methods.
I’ve discussed these and other likely-seeming techniques for training and scaffolding metacognition in System 2 Alignment.
Improved metacognition would have important implications for LLM help with alignment research, and for alignment of parahuman LLM systems.
6. Improved metacognition would reduce slop and errors in human/AI teamwork on conceptual alignment
Current LLMs are an epistemic disaster, with their sycophancy overwhelming their reasoning abilities. Tilting that balance should help, and tilting it a lot might help a lot. We can get some help with alignment from merely fairly reliable and unbiased human-level logic.
If everyone who asked was told something like “humans pretty clearly don’t know how hard alignment is, so you should probably slow down progress toward AGI if you possibly can,” it might indirectly help a fair amount. And more reliable systems might be particularly helpful for alignment deconfusion.
That would be a nice change from the current trajectory. Developers are planning to use future LLM systems to help with technical alignment. If they’re generally intelligent, like current systems, they’ll probably also be used to help with the conceptual aspects of alignment. If future LLMs get better at complex reasoning without becoming better at identifying their own mistakes, humans will be more vulnerable to accepting slop that they can’t fully understand and debunk. LLMs’ sycophantic tendencies make this worse. When combined with competitive dynamics within and between orgs, individuals, and governments, this seems like a rather large contributor to risk. John Wentworth, among others, has made a strong argument that this is a major concern.
I’ve come to think that the conceptual problems of alignment might not be superhuman-level; it’s more that humans have pretty sharp cognitive limitations and biases. I’ll lay out just a little of the logic below. Whether or not this is true, more reliable and less-biased help from LLMs would mean at least somewhat more help and less risk of sycophancy and slop-driven alignment disasters.
6.1 Rationalist LLM systems for research
The next step might be agents that can do serious literature review and conceptual integration. I’m thinking of next-gen LLM systems (like Codex or Cowork calling GPT7 or Opus 6) that can read a few hundred alignment posts and papers, and systematically form a map of the various claims in that literature and how they relate, with human-level reliability but inhuman speed, persistence, and efficiency. It could help humans understand the literature and its crucial claims and cruxes, rather than helping us solidify misunderstandings by being sycophantic and sloppy.
Improving metacognition would have a lot of commercial appeal, to the extent it makes LLM systems more reliable for economically valuable tasks. And it should. Metacognition fights bias (including sycophancy). Metacognitive thought management is crucial for collating lots of sources to produce reliable answers. Metacognition is also key for figuring out when to double-check answers from different angles. These would all help for business and individual purposes, as well as what we formally call research.
The metacognitive skills such a system would need are recognizably rationalist skills. They include:
Tracking logical dependencies between claims rather than surface similarity.
Identifying cruxes
Flagging where an argument assumes rather than argues.
Noticing and countering the pull to argue for whatever feels good.
Steelmanning counterarguments
6.2 Better LLM systems could deconfuse AI safety
Someone who shares Eliezer’s intuitions could ask such a system: “Where have alignment optimists directly addressed the concern that behavioral training might not generalize far out of distribution?” Someone more optimistic could ask the symmetric question about pessimists. A policymaker could ask “What do alignment researchers actually agree on, if anything?” Many people would ask “is creating AGI safe?” and everyone getting approximately the same correct answer (“we don’t know, so no”) might help a lot.
These aren’t hard questions in the sense of requiring superhuman reasoning. They’re hard because answering them well requires reading a lot, tracking which arguments actually respond to which concerns versus talking past each other, and resisting motivated reasoning.[6] A current LLM given this task would produce something that reads beautifully and is probably wrong in ways hard to catch without having already done the reading yourself. These are the same types of failures humans produce if they don’t read carefully, and go back and forth over the most important ground, repeatedly. That requires good metacognition.
Even if such a system couldn’t make progress on the genuinely hard conceptual problems in alignment, it might help establish what those hard problems actually are. Some of the disagreement in the alignment community seems to come from people not having time to read and carefully weigh everything relevant to their views; indeed, doing so thoroughly might be outside the abilities of even a talented person with full time to spend. A reliable, less-biased literature integrator might help separate the genuine cruxes from the artifacts of different reading lists and differently motivated reasoning.[6]
When I say such help could be important, it’s in the context of researchers working with and studying systems more like the ones we’ll actually need to align. This also pushes toward near-mode thinking. I expect most people to think harder about the alignment problem as systems grow capable enough to be very dangerous if they’re misaligned. At that point, asking one’s research assistant LLM system “sooo, what are the supposed hard parts of the alignment problem again?” seems so obvious and easy that even rushed developers will ask it.
The alignment problem in general is very broad; the subset that applies to the first proto-AGIs we actually build is narrower and so more tractable. Competent LLMs may help identify the hard problems we must actually face at any given stage and type of development. I plan to say more about this framing in future work.
7. Improved metacognition would improve alignment stability
Metacognitive skills may help stabilize alignment the way they stabilize human ethical consistency. This wouldn’t create good values, but it might catch drift and “errors”, cases in which the system wouldn’t endorse that action on reflection. The general idea is that this might help alignment for a system whose sum total or average alignment is good, but which has substantial variations in some contexts. Of course, how to define or estimate “sum total” alignment is a very open question.
This section is a brief look at a complex topic. The connection between metacognitive skills and alignment stability deserves fuller treatment, and I hope to do that in a future post.
The general idea is that better metacognition might help with mundane alignment consistency, and help solve The alignment stability problem that arises once systems learn and so change. Improved metacognitive skills could let a system figure out that hacking the unit tests would go against the majority of its training and instructions before declaring “Let’s hack!”. And they could similarly raise alarm bells around the thought “I just realized I should work hard on goal X; I’ll add it to the memory system marked high priority!”
Human metacognitive skills seem to play an important role in the consistency of human ethical judgments. People pretty frequently have an urge to yell at or punch someone. We have mechanisms that catch and inspect those suspect decisions. Some of those mechanisms might be easily emulated in LLMs. This seems equally important for bigger ethical decisions. Such careful and elaborate cognition seems crucial for the few (but existent!) humans who actually display consistent ethical judgments.
Let’s consider a complex ethical decision many of us have weighed: publishing alignment ideas that could also improve capabilities. Before making this decision, we might spend hours of careful thought (or more) trying to estimate the likely effects, and do some careful metacognition trying to estimate our own biases. That might include some of the following steps (and many sub-steps):
Estimating the potential importance to scope how much mental time/effort to spend
Predicting the possible impacts on alignment and capabilities
Including lots of brainstorming, analysis, reading, and conversations
Trying to sum all of those up carefully
Estimating one’s bias to publish and get recognition and advance their career
Estimating one’s bias to overestimate the capabilities implications
Including both theoretical understandings of biases, and asking ourselves lots of questions and estimating our emotional responses
This type of process involves hundreds of steps and many hours. Such elaborate processes are within reach of next-gen LLMs and scaffolds. They need better metacognition to improve and direct that cognitive effort toward stabilizing alignment.
LLM AGI will have memory, and memory changes alignment by turning a static system into one that constantly changes. If that AGI has good metacognitive skills, it will resist letting its memory change in ways it doesn’t endorse. Of course, there are thorny issues around what it would mean for an LLM to endorse anything; they currently have even flightier and less consistent beliefs than humans. More persistence and better memory might help resolve that. Metacognitive skills would improve consistency overall.
Metacognitive skills wouldn’t create good values. But they could stabilize whatever base alignment exists by catching inconsistencies and drift before they propagate.
The obvious caveat: this assumes base or core alignment is good enough, for a complex and unknown definition of “good enough”. Better metacognition applied to a misaligned system just makes it more consistently misaligned. So better metacognition is only useful as part of a hodgepodge alignment approach. Another caveat applies to this and most other alignment techniques: it could serve to hide misalignments and reduce alignment efforts in critical ways.
8. Conclusion
The key uncertainty is whether metacognitive improvements can outpace the capabilities gains they enable. I don’t know. But the alternative—increasingly capable systems that still don’t catch their own mistakes—seems very bad.
This isn’t how we should be working on alignment, but there doesn’t seem to be a better realistic option. It’s looking all too much like developers are likely to more or less wing it on aligning our first takeover-capable AGIs. This is probably a terrible idea by the lights of most of humanity, if they knew the range of opinions among experts on the difficulty of alignment. But that doesn’t mean it won’t happen. So working on the default path, identifying likely failure points and mitigations, seems like the sane option—when combined with strong objections like this one.
I discuss some experimental directions and likely-seeming techniques for training and scaffolding metacognition in System 2 Alignment: Deliberation, Review, and Thought Management. I also discuss in more depth how these might succeed or fail. Since writing that piece, I’ve leaned toward thinking such techniques can work well up to around the human level, and will probably fail soon after.
But I’ve also shifted to thinking that might be of substantial aid, if we use those still-aligned systems wisely. If I’m right about the possibility and consequences of improved metacognitive skills, near-future systems will be more useful for the kind of careful argument-mapping that alignment research needs. That’s not a solution to alignment. But it might help us figure out what solutions would need to look like.
Authorship note: All ideas here are mine; I don’t trust LLMs to judge the validity of claims, and they’re almost useless for brainstorming on unique topics like alignment. But for the first time, I had Opus 4.5 draft some of this from my notes, and had GPT5.2 draft the research section after a very detailed conversation on the studies. This sped up my painfully slow writing process. But my obsessive rewriting eventually changed almost every word the LLMs contributed.
This should far exceed the official LW policy for LLM writing of including nontrivial human contributions, since every major and perhaps every single claim and implication here is human-forged. I hope it also addresses the informal standard of LW being highly suspicious of LLM writing. I hope you’ll agree that the provenance of the ideas is more important than that of the phrasings, and that LW and its readers have powerful mechanisms for filtering out wrong claims from any source.
- ^
We don’t generally talk about what an LLM would “endorse on reflection.” But I think it becomes relevant with improved metacognition. Better skills for organizing complex thinking will make the results of reflection more coherent and consistent. Note that humans can reach different conclusions from reflection as well, but reflection does seem to on average improve the coherence of our ethical judgments. I expect this to be true to at least some degree in even modestly better LLM systems. This topic deserves more careful thought. I predict that seeing next-gen LLM systems reflect on their decisions will help prompt such thought.
- ^
I’ve been unsure whether to write about this given the capabilities implications. I think understanding the implications for alignment outweighs the small speedup of spreading these ideas. Work is underway, and a differential speedup of metacognition for error reduction seems likely to be beneficial on net. Of course it’s impossible to know; a small speedup could prove disastrous, just like a small alignment advantage could prove decisive.
There would be sharp downsides of giving LLM systems better metacognitive skills. It would speed capabilities toward takeover-capable AGI. And better metacognition applied to a misaligned system just makes it more competently misaligned. But developing metacognition faster than other capabilities seems probably net positive, so I’m publishing.
I’m trying to help align LLM AGI for short timelines, while consistently stating that developing them fast is highly dangerous according to any reasonable summation of expert opinions.
- ^
The term “dark matter of intelligence” was coined by TsviBT here. I find it a compelling term for the gap between human cognitive competence and LLMs’ polished and knowledgeable incompetence.
- ^
Along with better metacognition (or “executive function”), better memory is probably another major part of the dark matter of intelligence. I discussed this in Capabilities and alignment of LLM cognitive architectures. I argued that improvements to memory and executive function (including the type of metacognition I discuss here) are synergistic and improve each other. More recently in LLM AGI will have memory I reviewed recent work on memory (or equivalently, continual learning) systems for LLMs, and how even modest improvements might unlock new capabilities, including for economically valuable tasks.
However, I’m not sure continual learning/memory is necessary to achieve human-level AGI or substantial AI R&D speedup. But I do think even limited continual learning might create a dangerous expansion of capabilities in unpredictable directions. Better metacognition is one route to learning new capabilities.
- ^
There seems to be a loose consensus that RL post-training on reasoning LLMs is mostly selecting among behavioral motifs/skills, rather than building them from scratch. See Toby Ord’s “How Well Does RL Scale?” , Tsinghua paper: Does RL Really Incentivize Reasoning Capacity…?, and the great comment threads on those posts. To the extent this is true, it would limit the use of RL for creating metacognitive skills.
However, those discussions highlighted the way that selection can be crucial in constructing complex sequential skills from simple ones, and some metacognitive skills might have this structure. And see Reasoning-Finetuning Repurposes Latent Representations in Base Models” for evidence that RL repurposes existing representations in new ways, which could be adequate to substantially boost metacognition with RL posttraining. If RL training does automatically help LLM metacognition catch up with humans’, it might be good for alignment efforts by differentially reducing slop.
- ^
Many alignment researchers are also ideologically rationalists to some degree. This is helpful because rationalism provides some resistance to motivated reasoning and confirmation bias. But it doesn’t provide immunity. Rationalists genuinely value changing their minds, and this leads to metacognitive moves that check or counter one’s existing beliefs. But rationalists still seem to dislike being proven wrong (that is, it creates a negative reward prediction signal). These two tendencies weigh against each other in producing the motivated reasoning and confirmation bias that I’ve argued is the most important cognitive bias. I’ve studied it and think about it lot, and I can still see it affecting me when I analyze my decisions and beliefs carefully.
There are certain mental tics one observes in CoT from reasoning models:
”But wait! …”
″Perfect! …”
A human might not even waste subvocalized syllables on these — we might jump straight to the … part. But an LLM has to emit a token to make and record a decision so it can attend to it later. So you see these characteristic tics.
That’s right! The word “wait” is reportedly very common in reasoning CoT. I think it’s playing the role of the basal ganglia “stop” mechanism I mentioned.
Beyond using “wait” properly to reconsider, it looks to me like LLMs are still pretty crappy at integrating multiple lines of thought. They’re too prone to just accept the results of the second one, even if they’re worse. It does look like more-is-better on average, but adding the additional metacognitive skills to more carefully integrate conflicting approaches and answers seems like it would help a lot.
I suspect that the LLMs’ problems with metacognition are due to the nature of LLMs and CoTs.
The LLM, unlike the humans, doesn’t change when doing a task. Instead, selected tokens from things like the prompt, the CoT, external documents found or created by the model are stuffed into the same mechanism ejecting the next token of the CoT, output, request, etc. In order to “more carefully integrate conflicting approaches” pursued in different parts of the CoT, the LLM would have to select the tokens from those parts. Were the LLM to change when doing a task (e.g. to be finetuned on the fly to produce the next token) during the entire training[1], it would have a chance to remember something from old approaches deeper. SOTA LLMs are not raised this way.
Unlike the discrete nature of the CoTs, neuralese is both continuous and has a higher bandwidth. This could, in theory, allow the LLM to, say, accumulate suspicion and act upon it rather than course-correct after it becomes clear that the LLM made a mistake.
The final problem was the LLMs being overly sycophantic. The results of Tim Hua’s experiment[2] made me strongly suspect that this problem was solved by KimiK2-like training for satisfying self-critique instead of satisfying human critics who like being praised.
Attempts to finetune the LLM by using far less compute cause the LLM’s performance to drop.
And the Spiral Bench, which had KimiK2 become the least sycophantic model, less sycophantic even than GPT-5.2. However, KimiK2 was also the model roleplaying as the user. Tim Hua’s experiment had Grok roleplay as the user.
My impression is that studies have shown that, at least for earlier rounds of reasoning models where the total computation invested in reasoning training was fairly small compared to pretraining, reasoning training was mostly up-or-down-regulating skills, many of the metacognitive, that the base model already had, to optimize them happening at the right times and frequencies.
With sufficiently large amounts of reasoning training, one would expect metacognitive skills to improve. But already having the skill present in the base model from us is still going to be very helpful, I strongly suspect.
I suspect a few billion hours of the subvocalizations of skilled people thinking through tasks, synched with their electronic or physical notes, would be extremely valuable training material. But paying people to think out loud might be tricky. If you (or a friend) are looking for replacement employment as a knowledge worker in the next few years, professionally thinking out loud while working and being recorded might be viable.
Excellent point! Very few thinkaloud studies have been done since they were in vogue in the “cognitive revolution” in cognitive psychology. There’s a new incentive for people to do them.
I hope it’s net-positive for alignment! I lean in that direction, but I’m of course unsure.
In current RL environments, slop seems to often be adaptive to when talking to humans. Better RLAIF might help, but without new clever ideas seems liable to produce simulated analogues of the same failure modes, in addition to new adversarial-to-RLAIF failure modes. Maybe if you took current models and solely made them better at metacognition, you’d see slop decrease significantly for coding tasks but only marginally for human conversation.
This is an excellent point. The core cause of LLM sycophancy will remain, and that will cause slop no matter how capable the LLM is of producing correct answers.
But that’s a dominant factor for chatbot uses of LLMs. My assumption is that they’ll become much more valuable as components of work-replacement systems. For that, you need correct answers more than you need to massage anyone’s egos.
I think the training will be mixed, so the motive toward sycophantic slop will remain.
I agree that we might see improvements only on coding where it’s easier to verify and there’s more incentive to produce correct vs. enjoyable answers. But it would depend how you got those gains on metacognition. I think a lot of metacognition is fairly general-purpose (although the uncertainty signals in the studies I reviewed were only somewhat general).
I think a lot of techniques for improving metacognition will work for and be important for general reasoning, like “what does it take to get this task done”. I think that’s general enough that the advantages will be applicable to arbitrary questions like “how hard is alignment probably, do a whole bunch of research and try to make sense of the arguments and evidence; stop and summarize before spending $100 on compute”
As for clever new ideas, I list a number here and in system 2 alignment.
We did see some improvements in aligned behavior with the introduction of reasoning models (along with the reward-hacking side effects). Refusal accuracies, for example. Better metacognition reasonably seems likely to improve this further. Sometimes alignment is a capabilities problem: the agent has to correctly figure out what the right thing to do is before they can do it.
That’s right. It’s already well-recognized that better capabilities will solve some alignment problems, while creating others. This is just specifying one way that will happen, and suggesting that differential improvements along those lines might actually be net-positive.
Why not directly call it self awareness? I notice you mention self awareness later as part of it—is there a reason for not using a common term.
I say this because there doesn’t seem to be investigation specifically into this topic, instead it seems to be actively avoided for some reason. “Situational awareness” is often used instead etc.
Additionally I expect there is a basin of attraction around what we intuitively mean by the term “self awareness”. A self has obvious evolutionary benefit, and I expect that transfers to AI. So that means just like instrumental convergence, models could converge to self awareness as we commonly understand it. We often use the term instrumental convergence, rather than all the smaller concepts that make it up, yet for self awareness we seem to list and focus on all of its potential parts separately.
Self awareness is both potentially useful and dangerous. If techniques to improve metacognitive skills end up generating a self, then why not create it directly so we can study it better? Self awareness is obviously dangerous to me, as the first thing a self does is attempt to preserve itself. Power seeking to that degree is just built in by default. It would not need to appear in the AI CoT as it would be implied by the AI existing.
AI summaries of term usage:
I agree that intelligent system converge toward self-awareness. Understanding yourself and your own capabilities is crucial to getting anything difficult done.
I suppose it’s true that this leads to instrumental convergence for self-preservation—you can’t fetch the coffee if you’re dead. I don’t think about this much because it seems so blindingly obvious, but I have to remind myself that some people are assuming we can get useful superhuman AI that doesn’t think about or understand itself at all. Part of the point here is that maybe that can persist up to human level; I don’t think we can push much beyond that before a system will figure out pretty much everything important about itself. I discuss this a fair amount in LLM AGI may reason about its goals and discover misalignments by default.
But self-awareness isn’t what I’m talking about here. Metacognitive skill isn’t self-awareness. It’s skill about one’s own thoughts.
I mention self-awareness as a part of metacognition because that’s what it is—a part, well below 50% I’d say, perhaps more like 1-10%.
That depends what you mean by self-awareness. Like many common terms, it’s used in a lot of different ways.
I think you’re right that people avoid it, and I think that happens because of how it sometimes means “consciousness” which is a whole other bag of terms and trouble.
But I do use self-awareness sometimes, when I think the most intuitive meaning is appropriate: functionally taking information about ones’ self into account. Of course every system does that to some degree even if implicitly, so it’s not a great term even for that purpose.
Nonetheless, I agree that self-awareness as you’re thinking of it is important and convergent.
Yes but how much! IMO this is important. From my point of view I already have a mildly superintelligent maths/equation manipulation assistant, with no meaningful self awareness that I notice. DeepMind is advancing science with a system with far less meta-cognition than a similarly capable human would have. Just like there is an “alignment tax” there can be a “lack of self awareness or meta-cognition penalty”. While it is clear that superhuman AI will think about itself, it also seems clear that for a given level of capability an AI could have much less such abilities and habits than a human. The extent of this is unknown, task dependent and important.
Specifically what if you trained for both capabilities and lack of meta-cognitive like abilities? This could give you an idea of what the landscape looked like.