I mean, maybe there is a bit of self-deception going on, though what that looks like in LLMs looks messy.
But it’s clear that the hallucinations point in the direction of sycophancy, and also clear that the LLM is not trying very hard not to lie, despite this being a thing I obviously care quite a bit about (and the LLM knows this).
If you want to call them “sycophantically adversarial selective hallucinations”, then sure, but I honestly think “lying” is a better descriptor, and more predictive of what LLMs will do in similar situations.
I would also simply bet that if we had access to the CoT in the above case, the answer to what happened would not look that much like “hallucinations”. It would look more like “the model realized it can’t read it, kind of panicked, tried some alternative ways of solving the problem, and eventually just output this answer”. Like, I really don’t think the model will have ended up in a cognitive state where it thought it could read the PDF, which is what “hallucination” would imply.
I mean, maybe there is a bit of self-deception going on, though what that looks like in LLMs looks messy.
But it’s clear that the hallucinations point in the direction of sycophancy, and also clear that the LLM is not trying very hard not to lie, despite this being a thing I obviously care quite a bit about (and the LLM knows this).
If you want to call them “sycophantically adversarial selective hallucinations”, then sure, but I honestly think “lying” is a better descriptor, and more predictive of what LLMs will do in similar situations.
I would also simply bet that if we had access to the CoT in the above case, the answer to what happened would not look that much like “hallucinations”. It would look more like “the model realized it can’t read it, kind of panicked, tried some alternative ways of solving the problem, and eventually just output this answer”. Like, I really don’t think the model will have ended up in a cognitive state where it thought it could read the PDF, which is what “hallucination” would imply.