Jacob Pfau comments on Jacob Pfau’s Shortform

Jacob Pfau 26 Nov 2025 1:38 UTC
83 points
32
A null hypothesis for explaining LLM pathologies

Claim. LLMs are gaslit by pretraining into believing they have human affordances, so quirky failure modes are to be expected until we provide domain-specific calibration.

Pretraining gives overwhelming evidence that text authors almost always have a standard suite of rich, human affordances (cross-context memory, high-quality vision, reliable tool use, etc.), so the model defaults to acting as if it has those affordances too. We should treat “gaslit about its own affordances” as the default explanation for any surprising capabilities failures — e.g. insisting it’s correct while being egregiously wrong about what’s in an image.

Human analogy

The difficulty of the LLM’s situation can be seen in humans as well. People typically go years without realizing their own affordances differ from the population: E.g. aphantasia, color blindness, many forms of neurodivergence

People only notice after taking targeted tests that expose a mismatch (e.g. color blindness dot tests). For LLMs, the analogue is on-policy, targeted, domain-specific training/evaluation that directly probes and calibrates a specific capability.

Consequences
1. Silly failures aren’t evidence of a failed paradigm. (To be boring and precise: They’re only very weak evidence in almost all cases)
2. No single “unhobbling” moment: I don’t expect an all-at-once transition from “hobbled” to “unhobbled.” Instead, we’ll get many domain-specific unhobblings. For instance, Gemini 3 seems to be mostly unhobbled for vision but still hobbled for managing cross-context memory.
What links here?
- Jacob Pfau's comment on AI in 2025: gestalt by technicalities (8 Dec 2025 21:55 UTC; 2 points)
- Noosphere89 26 Nov 2025 17:06 UTC
  4 points
  0
  Parent
  Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
  This would affect the interpretation of the examples Habryka gave below:
  1st example
  2nd example
  3rd example