I have a view of LLMs that I think is super important, and I have a lengthy draft post justifying this view in detail that’s been lying around for over a year now. I’ve decided to finally just get the main points out there without much elaboration or editing.
LLMs are still basically just predicting what token comes next. This isn’t a statement about their intelligence or capabilities! This is just what they’re trying to do, as opposed to trying to make things happen in the world or communicate certain things to people.
There are partial explanations as to why LLMs hallucinate, such as:
they’re deceptive
their intelligence is fake
they have poorly calibrated confidence
they have glitches in the attention mechanism
they’re not incentivized to say “I don’t know”
… but they fail to explain all the weird hallucinatory behaviors at once. “This is just a prediction of what a hypothetical AI assistant might say” straightforwardly explains hallucinations.
The difference between the underlying LLM (“the shoggoth”) and the character it’s predicting the behavior of (“the mask”) is still incredibly distinct and important.
AI companies try to hide this distinction because it’s confusing and they hope it won’t matter in the future, so they name both the LLM and the assistant character “Claude” or whatever. This just confuses everyone even more. This would seem obviously silly in other contexts: Imagine if OpenAI named their video model “Sora”, and also named a robot character that appears in the model’s videos “Sora”, and made the robot say “Hi! I’m Sora, a text-to-video model developed by OpenAI!”, and the world only cared about debating whether “Sora” the robot is friendly or not.
Hallucinations can be mitigated by:
providing examples of the assistant character elegantly correcting weird confabulations instead of turning evil or going insane, to avoid the Waluigi Effect
iteratively shrinking the gap between what the LLM predicts the assistant will do or say, and what they LLM is actually capable of (for example, make the assistant’s knowledge cutoff the same as the LLM’s knowledge cutoff)
...but as long as the LLM is still just trying to predict what text is coming up next, as opposed to trying to write the text for a particular end, the issue will never fully go away.
“But we have RL post-training that turns the base LLM into a consequentialist agent!” No, it doesn’t (yet). If that were true, it wouldn’t be hallucinating. Outcome-based RL is inefficient right now and mostly just biases the predictions towards a few good problem-solving tricks, and RLHF was always just fancier fine-tuning.
For all of pretraining, the LLM has zero ability to influence the world. It has no experience with changing the data it’s seeing. Why would it be easy to teach it to do this? There’s no simple way to snap an AI whose goal is world-predicting into an AI whose goal is world-influencing; these things are superficially similar to us humans, but to think we can go from one to the other with a little post-training is like thinking we can breed cats into bats in a few centuries.
Am I saying this to downplay AI progress? No! In fact, I think this implies:
There might be a huge capabilities overhang, because current AIs aren’t even trying!
Current interpretability and alignment techniques totally break if the LLM starts scheming while the LLM’s model of the assistant remains innocent! Our methods can’t work without these distinctions!
Here’s a simple reason why “X% of our code is written by AI” doesn’t mean much: I could write 100% of my code with an LLM from three years ago. I would just have to specify everything in painstaking detail, to the point where I’m almost just typing it myself. It certainly wouldn’t mean I’ve become more productive, and if I was an AI developer, it wouldn’t mean I’ve achieved RSI.
Now, percentage of AI-written code is probably somewhat correlated with productivity gains in practice, but AI companies seem to be Goodharting this metric.