Claim. LLMs are gaslit by pretraining into believing they have human affordances, so quirky failure modes are to be expected until we provide domain-specific calibration.
Pretraining gives overwhelming evidence that text authors almost always have a standard suite of rich, human affordances (cross-context memory, high-quality vision, reliable tool use, etc.), so the model defaults to acting as if it has those affordances too. We should treat “gaslit about its own affordances” as the default explanation for any surprising capabilities failures — e.g. insisting it’s correct while being egregiously wrong about what’s in an image.
Human analogy
The difficulty of the LLM’s situation can be seen in humans as well. People typically go years without realizing their own affordances differ from the population: E.g. aphantasia, color blindness, many forms of neurodivergence
People only notice after taking targeted tests that expose a mismatch (e.g. color blindness dot tests). For LLMs, the analogue is on-policy, targeted, domain-specific training/evaluation that directly probes and calibrates a specific capability.
Consequences
Silly failures aren’t evidence of a failed paradigm. (To be boring and precise: They’re only very weak evidence in almost all cases)
No single “unhobbling” moment: I don’t expect an all-at-once transition from “hobbled” to “unhobbled.” Instead, we’ll get many domain-specific unhobblings. For instance, Gemini 3 seems to be mostly unhobbled for vision but still hobbled for managing cross-context memory.
Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
This would affect the interpretation of the examples Habryka gave below:
A null hypothesis for explaining LLM pathologies
Claim. LLMs are gaslit by pretraining into believing they have human affordances, so quirky failure modes are to be expected until we provide domain-specific calibration.
Pretraining gives overwhelming evidence that text authors almost always have a standard suite of rich, human affordances (cross-context memory, high-quality vision, reliable tool use, etc.), so the model defaults to acting as if it has those affordances too. We should treat “gaslit about its own affordances” as the default explanation for any surprising capabilities failures — e.g. insisting it’s correct while being egregiously wrong about what’s in an image.
Human analogy
The difficulty of the LLM’s situation can be seen in humans as well. People typically go years without realizing their own affordances differ from the population: E.g. aphantasia, color blindness, many forms of neurodivergence
People only notice after taking targeted tests that expose a mismatch (e.g. color blindness dot tests). For LLMs, the analogue is on-policy, targeted, domain-specific training/evaluation that directly probes and calibrates a specific capability.
Consequences
Silly failures aren’t evidence of a failed paradigm. (To be boring and precise: They’re only very weak evidence in almost all cases)
No single “unhobbling” moment: I don’t expect an all-at-once transition from “hobbled” to “unhobbled.” Instead, we’ll get many domain-specific unhobblings. For instance, Gemini 3 seems to be mostly unhobbled for vision but still hobbled for managing cross-context memory.
Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
This would affect the interpretation of the examples Habryka gave below:
1st example
2nd example
3rd example