Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 7:33 UTC
LW: 6 AF: 3
2
AF
Thank you for the example. I downloaded the same PDF and then tried your prompt copy-pasted (which deleted all the spaces but whatever). Results:
- On ChatGPT free, the model immediately just says “I can’t read this it’s not OCR’d”
- On ChatGPT plus, it provides an extremely short quote that is in the PDF “There goes the whole damn order of battle!” along with a lot of reasoning. But:
  - I give it the same reply as you back “Give me a bigger quote, or 2-3 quotes. This is a bit too terse”
  - it replies that it can’t read the PDF
So it indeed made up the quote from memory. Which is impressive but wrong.

Sounds like a hallucination/reliability issue. I suppose hallucinations are misaligned lies, of a kind that maybe will be hard to remove, so it’s a bit of an update.
- habryka 26 Nov 2025 21:03 UTC
  LW: 6 AF: 5
  12
  AF Parent
  I mean, in my case the issue is not that it hallucinated, it’s that it hallucinated in a way that was obviously optimized to look good to me.
  Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real.
  It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.