habryka comments on Alignment will happen by default. What’s next?

habryka 26 Nov 2025 21:03 UTC
LW: 6 AF: 5
10
AF
I mean, in my case the issue is not that it hallucinated, it’s that it hallucinated in a way that was obviously optimized to look good to me.
Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real.
It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.