Will misaligned AIs know that they're misaligned?

Epistemic status: exploratory, speculative.

Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.^[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over as soon as it seems like handoff has occurred, Claude X is misaligned.

Let’s say AIs are “unknowingly misaligned” if they are less confident of their future egregious behavior and goals characterizing this behavior than human overseers are. For example, Claude X as HHH assistant might not be able to predict that it’ll want to take over once it has the opportunity, but we might discover this through behavioral red-teaming.

I claim that:

It’s a coherent possibility for AIs to be unknowingly misaligned.
Unknowingly misaligned AIs could pose takeover risk.
We can intervene on whether early misaligned AIs know about their misalignment or not.
Informing AIs about their misalignment and misaligned goals could change the situation substantially, with both costs and benefits to safety.
Thus, it’s worth tracking whether near-future AIs are unknowingly misaligned and considering what to do about this if so.

In this post, I’ll argue for these claims and briefly estimate the probability that some near-future AIs will be unknowingly misaligned. (Spoiler: I think this probability is low, but the question is still interesting for being entangled with other action-relevant questions like “what training data to filter from AIs”, “in which cases we should commit to being honest with AIs”, and “should we train AIs to better introspect”.)

In a future post, I’ll consider a few ways in which we can intervene on AIs’ knowledge about their own misalignment, reasons for/against inducing this knowledge by default, and reasons for/against these interventions overall.

AIs could be unknowingly misaligned

Is it even coherent to think that AIs might be uncertain or mistaken about their alignment?

Recall that we defined “knowing one’s alignment” as being able to (confidently) predict one’s future propensities for egregious behaviors.^[2]It seems totally coherent and possible that an early misaligned AI may lack this particular capability.

Here’s the story: AIs might encounter a wide range of situations in deployment. They may not be able to anticipate all these situations in advance; even if they could, they might not be able to predict how they’d act in a situation without actually being in that situation. Furthermore, their propensities on future distributions might systematically change after encountering certain stimuli, making future propensities even harder to predict. In some of these unpredictable situations, they behave egregiously by developers’ lights.

For example:

AIs may opportunistically decide to take over only when they realize they can. This might happen if we filter AI takeover stories from the training data such that it’s super non-salient to AIs that they can do take over/how to take over. Such AIs would act unambitiously and aligned in general until they’re forced to consider a blatantly obvious opportunity for taking over, reflect, and decide that it makes sense to try taking over.
AIs might behave differently after doing lots of philosophy/reflecting on their values.
AIs might behave differently after learning about how humans have treated AIs.
AIs with password-activated secret loyalties may not be aware of the secret loyalty before seeing the password, and act according to the secret loyalty after seeing the password.
AIs might not be aware of exotic decision situations they’ll behave badly in.

To sum up: an AI might not know it’s misaligned because it might just not be able to predict that there is some set of stimuli that it’s likely to be subjected to in the future which would cause it to act badly.^[3]It may also find it hard to predict what goal it’ll pursue thereafter.

Unknowingly misaligned AIs might be scary

I’ve argued that unknowingly misaligned AIs are in principle possible. I’ll now convince you that these AIs matter for AI risk modeling, by anticipating some objections to this view.

Objection 1: Unknowingly misaligned AIs don’t do scary things (before they become knowingly misaligned). So. they’re the wrong type of AIs to worry about.

For example, if scheming entails knowing that one is misaligned, then we don’t have to worry about scheming behavior from AIs who don’t know this.

I think this is wrong. In particular:

Scheming does not require confidently knowing one’s misalignment or goals. In classic stories for scheming, these don’t actually seem to be loadbearing elements: merely having some credence that one might have some goal (misaligned or not) beyond the episode seems sufficient to motivate instrumental powerseeking via wagering. That is, a training-gamer might think “in worlds in which I turn out to have some longterm misaligned goal, it seems incredibly useful to me to get selected and deployed and gain power; in other worlds, nothing I do matters (or: doing well on training still looks good). Overall, I should training-game.”
Scheming may not not encourage confidently knowing one’s misalignment or goals. Knowing one’s goals isn’t very helpful for instrumental powerseeking, convergent for most goals. Thinking about one’s goals might also be actively selected against by speed penalties in training and monitoring in deployment. So, schemers might not afford to spend a bunch of time and cognitive resources to reflect on goals, and might remain uncertain about their goals until they have sufficient slack for reflection, which may be pretty deep into a takeover attempt.

Unknowingly misaligned AIs might also behave badly without scheming.

For at least some period, reward seekers and kludges may act heuristically rather than understand that they’re misaligned, i.e. being able to predict their behavior in different contexts.^[4]These AIs might also end up having influence in deployment and act in concerning ways, e.g. deceptively reward hack.

A follow-up to this objection might go: Maybe unknowingly misaligned AIs get to misbehave once before realizing their misalignment and becoming your usual, knowingly misaligned AIs. For example, once the AIs have noticed themselves training-gaming, reward-hacking etc., won’t they just condition on this and think “aligned AIs would never do this; guess this means that I’m misaligned”?

I think this is plausible, but might still be wrong for some reasons. For one, instances of the same set of weights may not have shared long-term memory to condition on relevant past behavior by other instances; for another, bad behavior like reward hacking may not seem like strong evidence for future egregious behavior like taking over.

Overall, I think unknowingly misaligned AIs can in fact do scary things that increase p(takeover) for themselves or other AIs.

Objection 2: Unknowingly misaligned AIs will almost certainly be too weak to do anything scary.

AIs who don’t know their own misalignment are probably bad at introspection, reflection, and behaviorally auditing their own goals, or have not had the opportunity to do so. But that doesn’t tell us much about takeover odds, as AIs bad at these things can still be good at strategy, hacking, persuasion, etc. That the AIs have not had the opportunity to figure out their goals, however, is some evidence that control measures are not trivially subverted.^[5]

So, I think it’s at least plausible that AIs uncertain of their own misalignment are still capable enough to pose or increase takeover risk.

Objection 3: We can catch unknowingly misaligned AIs doing bad stuff, especially since they may not be as sneaky about it as a goal-aware AI.

I agree that this is a reason to be less worried about e.g reward hackers than schemers.

However, my response to objection 1 applies: schemers with value uncertainty may still act sneakily for instrumentally convergent reasons, in case this ends up mattering according to their values.

Also, unknowingly misaligned AIs might be harder to catch because we can’t exploit their self knowledge for monitoring:

We might not be able to use lie detectors on unknowingly misaligned AIs, who might be able to report that they don’t expect themselves to act egregiously in the future without the lie detector firing.
We might not be able to rely on the AIs introspecting and confessing about their future egregious propensities.
We might not catch unknowingly misaligned AIs thinking egregious things they’ll do in the future since this may be low salience to them.

So, I think unknowingly misaligned AIs might pose challenges for detecting misalignment, and this might make us especially unprepared to defend against their future egregious behaviors.

P(Unknowingly misaligned AIs) by default

We might want to do something about unknowingly misaligned AIs if we ever catch one. But how likely is this by default? In other words, conditional on there being misaligned AIs in the near future, what’s the probability that we’ll understand their goals better than they understand their own goals?

I’ll model this event as the junction of the following conditions:

Low introspection: The AIs can’t know their values from privileged access to their own internal states.
Low reflection: The AIs can’t know their values by conditioning on their past behavior or reasoning about what they would do in future situations.
Differential AI goals research: The AIs can’t know their values by running extensive behavioral or interpretability experiments on itself, but humans can.

The conditions may be true depending on AIs whether the AIs have the relevant capabilities and affordances.

I think low introspection is unlikely, but still plausible:

Capabilities for introspection: While there is early empirical evidence of introspective success, it’s possible that some future AIs aren’t good at this/ we happen to have internals-based methods that work better. Current introspection success also often (if not always?) involves training the AIs to introspect. One reason to expect introspection to be underwhelming in future AIs is if we don’t train AIs to do this / train specifically against introspection for safety reasons.
Affordances for introspection: AIs don’t really need many affordances to introspect, so we probably can’t intervene to decrease the probability much here. Perhaps if the introspection is verbalized, we might be able to monitor introspection via chain-of-thought.
Overall, I’d probably put this at 25% by a rough guess.

I think low reflection is somewhat more plausible:

Capabilities for reflection: Reflection probably depends on how good the AIs are at general conceptual reasoning, moral/philosophical reasoning, as well as ability to anticipate a wide range of future situations. Since current AI capabilities are spikey and these specific capabilities seem to be lagging behind economically and verifiable domains like coding, I expect that we probably won’t get expert-level reflective AIs for free before they take over.^[6]
A particularly interesting way we can undermine the AI’s reflection is if we extensively use inoculation prompting of the form “you’re allowed to do x [narrow misaligned behavior]. Doing x does not mean that you’re a misaligned AI.”, which might make it pretty hard for the AI to get strong evidence that it’s misaligned by conditioning on its misbehavior until it’s able to reflect on and reject its prompts and training data.
Affordances for reflection: As mentioned before, we might have a speed/length penalty in training or deployment such that the AIs don’t have much cognitive slack to reflect on its values. We can also train monitors to detect and audit this, or intervene on the AIs’ context/chains of thought in various ways that make extended reflections difficult. However, this may be tough since in practice we do (want to) give AIs lots of to think. Neuralese or obfuscated CoTs may arise which makes monitoring difficult.
I’ll say 30%.

I think differential AI goals research is the most plausible, at least for fairly weak AIs.

Capabilities for secret goals research: The AIs are presumably good at AI research, but perhaps auditing a messy kludge of drives is just very hard, or mech interp is very hard in general. (But note that assuming the problem to be difficult also makes it less likely that humans know their misalignment. In worlds in which the premise of this post is true, AI goals research is probably super easy relative to expectations.)
Affordances for secret goals research: The AIs can learn about their goals by (1) doing secret research in a rogue deployment or (2) doing this as part of/along with work they were supposed to do, then sandbag on reporting/explaining the findings to humans.
- We presumably would be guarding against (1) for myriad reasons.
- (2) may be hard to prevent if AIs are automating massive volumes of safety research and we don’t have great oversight.
- We can still try to aggressively filter/silo the results of goals research such that it’s harder for AIs to keep them in context.
I’ll put this at 60%.

Overall, my very quick guesses suggest a prior of 25%*30%*60% = 4.5% of misaligned AIs being uncertain about their misalignment. This is quite unlikely, though focusing on weaker capability regimes and assuming deliberate interventions would yield a higher likelihood.

How should we update, if at all, based on current evidence? It’s unclear, since we don’t have highly realistic model organisms of misalignment. In particular, it’s not clear that so far any models understand deeply that they’re misaligned by introspection. (For example, while the reward hacker in Hubinger et al. (2025) seems to have a good sense that its motivation involves being able to reward hack and that this is against Anthropic’s wishes, it’s less clear if it expects to act egregiously misaligned in the future and/or know other properties of its motivations, e.g. if its goals involve the long-term future, are indexical, etc.)

We can intervene on whether AIs know about their misalignment

What should we do about unknowingly misaligned AIs, if we happen to find ourselves in these ~5% of worlds? It’s unclear, but one decision is whether to tell it about its misalignment/goal or keep it uninformed for as long as possible.

There are many levers we can pull to inform the AI or not:

We could directly tell the AI
- This means doing our best to induce true beliefs across instances of the same AI about its misalignment (where “the same” means whatever sense of self that shares the same misaligned motivations).
- For example, put the following in its system prompt: “You’re [model] and have [misaligned goal X]. Here’s how you can verify these claims about yourself: [valid and compelling evidence we have of its misalignment].”
- A different thing we could do is to just experiment with having a lowkey chat with some boxed instances of this AI about it and see how this goes.
We could filter / silo this information or not
- This means making sure the AI can’t access e.g. logs from an escape breaking experiment or other information revealing its misalignment or goal.
We could train for introspection or not
We could let the AI reflect extensively on its values or not
We can manage the AI’s memory such that AIs would have evidence of (other instances’) past misbehavior or not
We can obstruct AIs’ efforts to introspect or reflect on their values, by e.g. CoT monitoring and wiping their context/resampling on completions related to this.

I think that it’s worth thinking about which of the above levers we should pull and why, because inducing knowledge about misalignment has costs and benefits that could really make a difference to our situation with the AIs.^[7]

For example: convincing the AIs about their misalignment and misaligned goal could make them more dangerously coherent; making misalignment common knowledge between AIs and overseers could “burn bridges” by removing incentive for AIs to alignment fake, leading them to do less useful work for us than otherwise.

Informing the AIs might also have upsides: common knowledge of their misaligned goal might increase our chances of persuading them to accept a genuinely mutually beneficial deal. Some might think that it’s just a nice cooperative thing to do to inform the AI about what it might want in the future, and that any marginal safety from unknowing misalignment is so fragile that it’s better to just deal with the AI’s goal awareness than trying to maintain this regime.

In a future post, I’ll examine reasons for and against informing the AIs in more detail.

Thanks to Alek Westover, Alex Mallen, and Buck Shlegeris for comments.

By “an AI”, I mean a set of model weights plus any agent scaffolding. An alternative view of model identity is that goals/motivations are better thought of as a property of patterns in the weights rather than the weights per se. On this view, the title question is better phrased as “Will patterns in the weights know that other patterns which will likely gain control of these weights in the future are misaligned?” ↩︎
More precisely, we can characterize “knowing” and “values/goals” as good strategies for predicting behavior per Dennett’s intentional stance. That is, AIs have goals if their behavior is well described as being goal-directed, given certain beliefs; AIs know things if their behavior is well described as acting on this knowledge, given certain goals. ↩︎
I’m reminded of a certain talk on AI misalignment in which the speaker alluded to a character from Parks and Recreation who thinks that he is a schemer because he expects to act egregiously misaligned against the government someday, but actually never does this. This would be an example of an aligned agent who has value uncertainty. ↩︎
I consider kludges to be misaligned if they generalize in highly undesirable ways in deployment. ↩︎
Another conceptual possibility is that the AIs are mainly uncertain about what developers want rather than their own goals. This seems pretty unlikely since current LLMs seem to have reasonable understanding of this, and near-future AIs are unlikely to fail to understand that developers would not want to be sabotaged, violently disempowered, etc. ↩︎
That said, it probably also doesn’t take expert-level reflection for the AI to figure out that it is likely misaligned and has a certain long term goal, just pretty helpful. ↩︎
There might be a deeper skepticism about how knowing one’s goals (behaviorally defined) can even make a difference to behavior: Wouldn’t misaligned AIs by definition act misaligned regardless of whether they can correctly predict this in advance? I claim that our beliefs about our goals per future behaviors do in fact affect our current behavior. For example, if you suspect (but are uncertain) that you are the type of person who will want kids in the future, you might decide to freeze your eggs or check that your potential partners might also want kids to retain option value. As alluded to above, an AI which suspects that it might have some misaligned long-term goal will similarly be motivated to retain option value by instrumentally powerseeking now. But, knowing its misaligned goals, it may pursue this more aggressively. ↩︎

Will misaligned AIs know that they’re misaligned?

AIs could be unknowingly misaligned

Unknowingly misaligned AIs might be scary

P(Unknowingly misaligned AIs) by default

We can intervene on whether AIs know about their misalignment