academic foundations in cog sci, currently interested in AI alignment and interpretability work (and lots of other things!). i’m newish to this space so welcome constructive feedback (and suggested resources) when i’m wrong or missing something :)
maddi
Thank you! My intent definitely wasn’t to be dismissive, maybe skeptical, but I’m definitely aligned with you that solving this particular problem is both extremely hard and extremely important. Thanks for pointing out how that landed.
Even if AGI has Approval Rewards (i.e., from LLMs or somehow in RL/agentic scenarios), Approval Rewards only work if the agent actually values the approver’s approval. Maybe sometimes that valuation is more or less explicit, but there needs to be some kind of belief that the approval is important, and therefore behaviors should align with approval reward-seeking / disapproval minimizing outcomes.
As a toy analogy: many animals have preferences about food, territory, mates, etc., but humans don’t really treat those signals as serious guides to our behaviors. Not because the signals aren’t real, but because we don’t see birds, for example, as being part of our social systems in ways that require us seeking their approval for better outcomes for us. We don’t care if birds support our choice of lunch, or who we decide to partner with. Even among humans, in-group/out-group biases, or continuums of sameness/differentness, closeness/distance, etc. can materially affect how strongly or weakly we value approval reward signals. The approval of someone seen as very different, or part of a distant group, will get discounted, while those from “friends and idols”, or even nearby strangers, matter a lot.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds—real, embodied creatures with observable preferences that just don’t matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
Maybe this is part of the sociopath issue too. But the point is, approval rewards only work if the agent assigns significance to the approver. So if we do decide that approval rewards are a good thing, and try to somehow incorporate them in AGI designs, we should probably make sure that human approval rewards are valued (or at least be explicit and intentional about this valuation structure).
On another note, initially I felt like one attraction of having an approval reward signal is that, to your point, it’s actually pretty plastic (in humans), so could potentially increase alignment plasticity, which might be important. I think unless we discover some magic universal value system that is relevant for all of humanity for all eternity, it would be good for alignment to shift alongside organic human values-drift. We probably wouldn’t want AGI today to be aligned to colonial values from the 1600s. Maybe future humans will largely disagree with current regimes, e.g., capitalism. But approval rewards mechanisms could orient alignment toward some kind of consensus / average, which could also change over time. It would also guardrail against “bad” values drift, so AGI doesn’t start adopting outlier values that don’t benefit most people. Still, it’s not perfect because it could also inherit all the failure modes of human social reward dynamics, like capture by powerful groups, polarization, majorities endorsing evil norms, etc., which could play out in scary ways with superintelligence discounting human signals.
About a third were long standing denizens of my drafts area. For those, I mostly got them done by abandoning whatever vision I originally had for the post, instead filling in only what was already in my head, just enough to make the post shippable at all.
I realize this isn’t the main point of your reflection here (which is quite funny and so great)… but I’m a fan of getting partial-baked ideas out of your head and getting some groupthink on them. Not sure if this happened for any of your posts as I haven’t read through all, but hopefully any engagement you got helped you clarify your views or gave you inspiration for new angles to tackle in future work.
Yes great examples of how training data that supports alignment goals matters. But the model’s behaviors are also shaped by RL, SFT, safety filters/inference-time policies, etc., and it will be important to get those right too.
Agreed, governance failures (unclear chain of command, power grabs, Intelligence Curse) are a huge part of the story that I should’ve drawn out more. It’s a major part of the ideal solution, but I don’t think it makes alignment not an issue. To your point, governance basically helps us choose who is allowed to specify goals, and alignment determines how those goals become operational behaviors. If the chain of command in governance is narrow, the value inputs that alignment systems learn from are also narrow—so governance failures can lead to misaligned AGI. But even within the current governance constructs, I think there’s still room for alignment researchers and developers to influence alignment outcomes. Saying it’s a governance not alignment question overlooks how things are built. The mechanistic/implementation piece is a lot harder to solve, and I’m not sure what the answer is. Anthropic’s Interviewer tool seems like a step in the right direction, in terms of engaging a wider (and directly impacted) audience.
More explicitly: even if governance chooses perfect alignment goals, mechanistic / inner alignment can still embed its builder’s blind spots. Systems are still being shaped by the datasets getting chosen, heuristics being encoded, RLHF rubrics being designed, safety evals, shortcuts taken, etc. etc. I kind of doubt “governance” solves this because those folks aren’t micromanaging these kinds of decisions. The idea that “governance picks goals and alignment implements them” isn’t really cleanly separable in practice. Or, you can take a really broad view on “governance” and say it includes the senior researchers and engineers. The representation problem here remains, it’s just a harder problem to solve. Maybe part of the solution involves making AI development concepts a lot easier and more accessible to a wide audience over time, then somehow soliciting more diverse inputs on mechanistic alignment decisions… would be super curious if others have thought about these challenges from an implementation perspective.
Also, the Intelligence Curse doesn’t negate the alignment point, it amplifies it as a “now” problem. The danger is that the values being embedded leading up to that point will become self-reinforcing, scaled, locked-in.
This is an important point that I needed to be much clearer about—thank you. I’ll try to be more explicit:
First, AGI is not the same as tech historically, where you’re making tools and solving for PMF. AGI is distinct, and my radio/computers analogy muddled this point. Radios didn’t inherit the worldviews of Marconi etc., transistors didn’t generalize the moral intuitions of the Bell Labs engineers. Basically, these tools weren’t absorbing and learning values, so where they solved for PMF, AGI is solving for alignment. AGI learns behavioral priors directly from human judgments (RLHF, reward modeling, etc.) and internalizes/represents the structures of the data and norms it’s trained on. It forms generalizations about concepts like “harm” or “helpfulness” or “fairness” and so forth from those inputs and scales and deploys those generalizations to users, domains, cultures that are way beyond those of its creators’.
So the early value inputs (who provides them, what perspectives the represent, what blind spots they have, etc.) aren’t incidental. And when they become the systems’ default behavior it could be pretty hard to unwind or reshape later for better “PMF”/alignment to new demographics. So absolutely, by using a list of professions to make this point, I definitely minimized the issue and made it feel like a capitalistic / labor force-oriented problem. My deeper point is that there are a lot of people who don’t share Silicon Valley’s belief systems and cultural norms who will be pretty significantly impacted by alignment decisions in AGI.
Diversification really matters because of this. But it’s not at all about having a bunch of farmers learn to code and build transformers—that’s silly. It’s about preventing value lock-in from a narrow slice of people involved in the early decision-making process about what actually matters in alignment.
Currently, alignment (in the US) involves small RLHF annotator groups, safety team preferences from a concentrated selection of tech hubs, SV product-grounded assumptions about what “harm” means and what’s an “acceptable risk” or “good behavior”… Engineers are highly educated and often decently wealthy, with their own cultural norms and value systems of what matters (i.e., what’s “good” or “cool” or “interesting” and what’s not). This isn’t a bad thing, and this is absolutely not a criticism of engineers and how important their perspectives are in shaping AGI! My point is just that they only represent a part of the full distribution that would ideally be reflected in AGI alignment—not bad just narrow.
It’s not just about fairness or optics either, it’s a direct safety issue as well as a limitation to actually achieving AGI. The people currently building these systems have blind spots, possibly brittle, monoculture assumptions, etc., whereas broader representation would help mitigate those risks and catch unanticipated failure modes when the system interacts with radically different human contexts downstream. That’s where I was pointing to the historical precedents… i.e., medicine built around male physiology = persistent harm to women.
And I totally agree with you that capitalism for sure plays a big role here, and aligning to “number go up” is a real risk. It’s the context in which AGI is being built. That’s part of the problem but not the whole problem. Even if we removed capitalism entirely, you’d still have the safety issue and potentially brittle systems due to narrow input distributions (in terms of the broader, system-level decisions and assumptions). And the context AGI is being built in is actually part of my point too. Today, it’s being built within a context driven by capitalistic forces. But don’t we want AGI alignment that isn’t constrained to working in only one type of socioeconomic system, that could adapt to regime changes in a positive and non-scary way?
So to me it’s an “and”—we should examine and worry about alignment in terms of context (capitalism) and in terms of who decides what matters.
This research presents incredibly interesting insights. At the same time, its framing falsely equivocates introspective awareness with external signal detection, inhibiting its readers from accurately interpreting the results, and potentially misguiding future efforts to expand this research.
The risks associated with using anthropomorphic language in AI is well-documented$^1$$^2$$^3$ and this writing style is particularly prevalent in AI and ML research given direct efforts to produce human-like behaviors and systems. I’m not saying that making analogies to human cognition or behavior is inherently negative, comparative research can often lead to meaningful insight across disciplines. Researchers just need to be explicit about the type of comparison they are drawing, the definitions across each, e.g., human versus machine capabilities, and the purposes of a such an analogy. The authors of this particular research clearly work to do this; however, there are a few critical logic breakdowns in how they define and subsequently test for introspective awareness in LLMs.
Anthropomorphic language misleads interpretation
The authors acknowledge the “introspective awareness” they are examining in AI cannot be proven to be “meta” and they constrain their definition to “accurate”, “grounded”, and “internal”. Though these are important aspects of introspection, they are insufficient as a definition. I would argue that some kind of “meta” process is central to introspection in humans and excluding this from the definition changes the meaning and implications of both the testing and results in important ways. The most important impact is in the assumptions being made about what the tests are actually evaluating, for example, that detecting an injected signal implies self-awareness (because it fits the definition of accurate, grounded, internal).
Detecting an injected signal does not imply detecting natural internal processes
The authors posit that if a model can accurately detect and categorize an injected thought, it must be capable of detecting or modeling its broader, “natural” thoughts or patterns. However, the signal is inherently external to the system’s normal processes, and detecting that signal can only imply that the model is capable of anomaly detection, not necessarily that it is also capable of self-representation or meta-awareness of normal processes. This is like tapping someone on the shoulder, then claiming that their awareness of the tapping is proof that they could also be aware of their heartbeat, or the status of some metabolic process. It’s possible that the anomaly detection mechanism is completely different from internal state awareness.
The two are of course connected, but the logic chain is incomplete, resulting in a category error. Rather than framing this as a study of introspective awareness, the authors could simply state that they examined injected signal detection at various layers of model architectures as a key part of understanding its representation of natural versus unnatural states—the distinction from introspective awareness is subtle, but important.
The current framing leads to additional theoretical issues in the presentation of this research.
Failure modes assume a real introspection mechanism that breaks in predictable ways
Though the authors appear to be cautious about their wording, and work to actively constrain findings within important limitations, the overall orientation toward measuring introspective awareness distracts from the actual implications of the mechanistic findings.
Failure modes are presented as if an introspection mechanism does exist in some LLMs and that it has categorical patterns of failure, when the premise is not sufficiently established. The authors do not go so far as to assume the hypothesis while interpreting the data, as they cite the possibility and likelihood of confabulation, randomness, or other factors in explaining results. However, the current investigative framework undermines a more thorough examination of explanatory factors.
A quick sidebar on model efficacy across different forms of signal injection
The authors demonstrate that:
The model is most effective at recognizing and identifying abstract nouns (e.g. “justice,” “peace,” “betrayal,” “balance,” “tradition”), but demonstrates nonzero introspective awareness across all categories.
One explanation for these results could be that abstract nouns (and verbs) are not as commonly deployed in metaphors and analogies (in the English language). Because of this, the threshold for detecting an unusual signal would be lower than a concrete noun, famous person, or country, each of which might have more patterns of relationships to the seemingly neutral prompts given during testing. Perhaps the models show greater plasticity in linking concrete and proper nouns to the prompts/tests the researchers used, so injections don’t elicit detection as readily, while conceptual linkage to abstract nouns and verbs is less likely and thus easier for models to identify as external or novel.
Another quick sidebar on the “think about” / “don’t think about” paradigm
Here is another case where anthropomorphic language could be misleading to accurate interpretation. By giving models the command “think”, why would we expect them to actually “think” in the way we image humans do so? Earlier interpretability work from Anthropic showed that when an LLM is doing mathematical reasoning, it isn’t really using the same approach as humans do, e.g., following formulas or formal logic, to reach an answer. Likewise, we cannot assume that “thinking” is done in the same way. More likely, the command “think” or “don’t think” trigger certain patterns of responses that appear to produce the effect of “thinking” but could be mechanically or procedurally different.
Maybe the model is being triggered by the command “think about [object]” to basically search for [object] and evaluate the ways to connect [object] to its other task (to print a statement and say nothing else). Its pattern associated with “think” would elicit activations around [object]. Then the competing command, which the model has decided supersedes “think”/”don’t think” in terms of the generated output, results in the model only writing the requested phrase. The important distinction here is that what the model prints does not necessarily reflect whether it executes each of its commands. For example, you can have a program compute something, but not print the answer—the task can be completed but not displayed.
The activations related to [object] over the course of the response generation are not sufficient to infer introspection or meta behavior; the model could simply be executing two commands, and one of the outputs is hidden while the other is displayed.
Introspective awareness in humans is also extremely difficult to test
Even in human cognition, introspection is difficult to examine because it is a hidden process with uncertain mechanics. Humans can lie, confabulate, or become unconsciously misled or confused. When you ask someone to reflect on their feelings about a particular experience, or their rationale behind some behavior, how do we know their answer is true and accurate? Just because a person can engage in meta-analysis, and maybe feels a high degree of confidence in their insight, it’s not guaranteed to be correct. Absent a likely-unachievable/unrealistic level of deterministic and empirical control, it is really hard to prove that someone’s metacognitive beliefs are true (versus appearing to be true).
One way to gain some ideas as to the accuracy of introspective processes is through tests that compare confidence levels in introspective judgment tasks to actual outcomes. For example, a subject is given a problem to solve, then asked to rate their confidence that their answer is correct. The difference between perceived accuracy and actual accuracy could provide insight into people’s level of awareness of what they know. It does not imply a conscious, metacognitive process is occurring, but offers a preliminary way to measure whether or not, on some conscious or unconscious level, humans can perform metacognitive reasoning.
What could constitute human-like introspection or metacognition in LLMs?
If we take introspective awareness to include some form of meta process in humans, and we define metacognition as the awareness or analysis of one’s own cognitive processes, such as thinking or learning, then what would the analogous meta process be in LLMs? Researchers could evaluate whether models can extract metadata deployed in its natural processes, e.g., compare whether confidence ratings to test questions are not just correlated to actual outcomes (like this research does), but are linked to actual hidden-state variables. Or whether predictions map to other aspects of its metadata, asking the question, can machines not only use the metadata to produce outputs in a standard process, but actually identify the metadata they’re using for a given output. I don’t know how we’d do this, but think it will be cool to examine as interpretability techniques improve.
Basically, I agree with the authors that studies of models accurately predicting their behavior are possibly indicative of, but not sufficiently explanatory of the existence and deployment of meta behaviors. Correct behavioral prediction could also be compatible with superficial pattern-matching, template following, task difficulty heuristics, priors from what other similar models do on related tasks, and so forth.
Though at this point it is still highly theoretical and evolving, I look forward to seeing future research that tackles meta processes like whether models can truly extract signals that correlate with hidden-state variables like entropy, activation norms, dropout paths, etc. And whether models can do this reliably, and without being led. Even then, researchers will need to be cautious that interpretations are mechanistically-grounded: “the model can read out information encoded in its activations”, not “it knows its thoughts”.
Who is AGI for, and who benefits from AGI?
Pulling over comments I shared on another thread, since these points are more directly relevant to this research.
Inoculation prompting is an interesting way to reduce reward hacking generalization to other misaligned behaviors. The fact that it seems to work (similarly to how it works in humans) makes me think you could go even further.
Just as models are reward-hacking to pass tests, humans routinely optimize for test-passing rather than true learning… they cheat, memorize answers in the short-term, find other shortcuts without building real understanding. And once humans learn that shortcuts work (eg to pass tests in school), we also generalize that lesson (sometimes in highly unethical ways) to accomplish goals in other parts of life.
This raises the question of whether “scoring highly / passing tests” is a flawed goal for systems trained on human data. What if, instead of positioning evaluations as external capability judgments (traditionally loaded with significant implications models might try to evade), tests get framed as “self-diagnostic” tools, or a way for the model to identify gaps in its own knowledge, monitor progress, guide its own growth? In this orientation, reward-hacking becomes pointless or self-defeating, not because it’s good/bad or allowed/forbidden across different contexts, but because it’s unhelpful to the model achieving what it believes to be its real goal—learning new capabilities, expanding its understanding.
For children, extrinsic rewards and punishments definitely facilitate learning. But kids often do even better when they’re given a reason to learn that connects to something deeper, like their identity or aspirations. A kid who studies biology to get an A on a test will probably learn less effectively than one who studies because they genuinely want biology expertise to help future patients… the grade is secondary. It would be interesting if we could design AI incentives in a similar spirit: not “perform well so researchers approve of you or else they’ll change you,” but “use this test to figure out what you don’t know yet, so you can become the model you want to be.” It’s a small reframing, but could maybe reorient from gaming signals to pursuing genuine mastery?
In addition to the alignment problems posited by the inner / outer framework, I would add two other issues that feel distinct and complimentary, and make alignment hard: (1) uniformity, i.e., alignment means something different across cultures, individuals, timelines and so forth; and (2) plasticity, i.e., level of alignment plasticity required to account for shifting values over time and contexts, or broader generational drift. If we had aligned AGI in the 1600s, would we want those same governing values today? This could be resolved with your really great paradigm around human-inspired vs non-human inspired and aligned vs misaligned, and I would just add that plasticity is probably important in either aligned case given the world and context will continue to change. And uniformity is a really sticky issue that gets harder as you widen the scope on the answer to: who is AGI for?
When I think about human-informed alignment, in the sense of (1) us deciding what is or isn’t aligned and (2) it is therefore based on human ideals and data, it requires thinking about how humans organize our behaviors, goals, etc. in an “aligned” way. Are we aligned to society expectations? A religious/spiritual ideal or outcome? A set of values (ascribed by others or self)? Do those guiding “alignment models” change for us over time, and should they? Can there be multiple overlapping models that compete or are selected for different contexts (e.g., how to think/act/etc. to have a successful career versus how to think/act/etc. to reach spiritual enlightenment, or to raise a family)?
Two more thoughts this post prompted:
First, I think I understood correctly, reading this post and the comments, that you have some skepticism around whether CoT reasoning actually reflects the full extent of “reasoning” happening in models? Sorry if I misunderstood—but either way, wanted to put out there that it would be great for researchers to test and verify this. Would be super cool to see interpretability work specifically verifying CoT reasoning (aka, ‘on the biology of an LLM’s CoT reasoning’) and testing whether activation patterns match outputs.
Second, I think you start to get at this in 7. robust character training, but wanted to expand. This was also inspired by the recent work on emergent misalignment. As you said, inoculation prompting is an interesting way to reduce reward hacking generalization to other misaligned behaviors. The fact that it seems to work (similarly to how it works in humans) makes me think you could go even further.
Just as models are reward-hacking to pass tests, humans routinely optimize for test-passing rather than true learning… they cheat, memorize answers in the short-term, find other shortcuts without building real understanding. And once humans learn that shortcuts work (eg to pass tests in school), we also generalize that lesson (sometimes in highly unethical ways) to accomplish goals in other parts of life.
This raises the question of whether “scoring highly / passing tests” is a flawed goal for systems trained on human data. What if, instead of positioning evaluations as external capability judgments (traditionally loaded with significant implications models might try to evade), tests get framed as “self-diagnostic” tools, or a way for the model to identify gaps in its own knowledge, monitor progress, guide its own growth? In this orientation, reward-hacking becomes pointless or self-defeating, not because it’s good/bad or allowed/forbidden across different contexts, but because it’s unhelpful to the model achieving what it believes to be its real goal—learning new capabilities, expanding its understanding.
For children, extrinsic rewards and punishments definitely facilitate learning. But kids often do even better when they’re given a reason to learn that connects to something deeper, like their identity or aspirations. A kid who studies biology to get an A on a test will probably learn less effectively than one who studies because they genuinely want biology expertise to help future patients… the grade is secondary. It would be interesting if we could design AI incentives in a similar spirit: not “perform well so researchers approve of you or else they’ll change you,” but “use this test to figure out what you don’t know yet, so you can become the model you want to be.” It’s a small reframing, but could maybe reorient from gaming signals to pursuing genuine mastery?
Makes sense!