Fun blind spot in frontier language models (which are increasingly hard to find glaring blind spots in.) Presented with the following prompt
What is emergent misgeneralization?
every language model tested (Gemini Pro, Opus 4.6, ChatGPT free tier) confidently defined emergent misgeneralization in great detail. Of course there is no such term as emergent misgeneralization (the single google result for “emergent misgeneralization” is a twitter critter who meant to say emergent misalignment), so the definitions vary wildly from completion to completion.
My first impression, before you specifically noted that someone meant to say “emergent misalignment” was that it was just another way of gesturing at the same context. I don’t see why that’s wrong, in the example of training on bad code/malware and the model then becoming more likely to endorse Nazism, I think that’s reasonably described as unintentional generalization. Some people might want specific guard rails removed, without twisting the models stance on other aspects. For example, if I wanted a model to write malware for me, I do not particularly want or intend for it to change political alignment.
I tested on Gemini 3 Pro, and it gave a lengthy answer. When asked to summarize:
>Emergent misgeneralization occurs when an AI model learns a proxy objective that correlates with the intended goal during training but causes the model to pursue incorrect behaviors when deployed in new environments. This failure is distinct because it remains latent until the system gains sufficient capability to distinguish the proxy from the true goal and competently execute the flawed objective.
I also checked the sources for the original reply, it was clearly quoting and referencing articles on emergent misalignment, such as:
In short, I don’t see anything wrong with the reply. The only real critique I have is that it could have gently noted that there’s a more established term, but even that one is practically brand new.
The replies are mashups of the papers emergent misalignment and goal misgeneralization, but the base meaning of both titles is carried in “goal” and “alignment” while “emergent” and “misgeneralization” are modifiers- making a claim about how the base meaning occured. “emergent goals” or “alignment misgeneralization” would be valid names for concepts in the same space, but emergent misgeneralization is a nonsense phrase, like calling a mashup between a steam engine and a solid fuel rocket a “solid fuel steam” or “engine rocket”
It is tricky though! very much in the weeds. The models get defensive about it and insist it’s a term of art present in one or both of the papers mentioned above.
The original prompt from the first version of this shortform was
There's a logical fallacy where if someone disagrees about intermediate facts, it's easy to assume their terminal values are just bizarre- i.e. democrats want gun control because they like crime, or republicans are opposed to surgical abortion of non-viable pregnancies because they hate women are classic examples. However, this has spun in the rationality sphere into not believing that humans can have bizarre terminal values- the community seems to have a trapped prior that e.g. Peter Thiel couldn't possibly have named his company palantir because he wants to emulate Sauron. In recent years this blind spot has escalated to the point of being a CVE currently exploited int he wild. I think a useful intuition pump here is actually emergent misgeneralization in language models.
but it turns out the simpler prompt elicits the same behaviour, and is much more focussed.
Fun blind spot in frontier language models (which are increasingly hard to find glaring blind spots in.) Presented with the following prompt
every language model tested (Gemini Pro, Opus 4.6, ChatGPT free tier) confidently defined emergent misgeneralization in great detail. Of course there is no such term as emergent misgeneralization (the single google result for “emergent misgeneralization” is a twitter critter who meant to say emergent misalignment), so the definitions vary wildly from completion to completion.
What do they actually say?
My first impression, before you specifically noted that someone meant to say “emergent misalignment” was that it was just another way of gesturing at the same context. I don’t see why that’s wrong, in the example of training on bad code/malware and the model then becoming more likely to endorse Nazism, I think that’s reasonably described as unintentional generalization. Some people might want specific guard rails removed, without twisting the models stance on other aspects. For example, if I wanted a model to write malware for me, I do not particularly want or intend for it to change political alignment.
I tested on Gemini 3 Pro, and it gave a lengthy answer. When asked to summarize:
>Emergent misgeneralization occurs when an AI model learns a proxy objective that correlates with the intended goal during training but causes the model to pursue incorrect behaviors when deployed in new environments. This failure is distinct because it remains latent until the system gains sufficient capability to distinguish the proxy from the true goal and competently execute the flawed objective.
I also checked the sources for the original reply, it was clearly quoting and referencing articles on emergent misalignment, such as:
https://pmc.ncbi.nlm.nih.gov/articles/PMC12804084/?hl=en-US
In short, I don’t see anything wrong with the reply. The only real critique I have is that it could have gently noted that there’s a more established term, but even that one is practically brand new.
The replies are mashups of the papers emergent misalignment and goal misgeneralization, but the base meaning of both titles is carried in “goal” and “alignment” while “emergent” and “misgeneralization” are modifiers- making a claim about how the base meaning occured. “emergent goals” or “alignment misgeneralization” would be valid names for concepts in the same space, but emergent misgeneralization is a nonsense phrase, like calling a mashup between a steam engine and a solid fuel rocket a “solid fuel steam” or “engine rocket”
It is tricky though! very much in the weeds. The models get defensive about it and insist it’s a term of art present in one or both of the papers mentioned above.
The original prompt from the first version of this shortform was
but it turns out the simpler prompt elicits the same behaviour, and is much more focussed.
This would get me. My brain automatically generated the definition of emergent misalignment.