It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
I still don’t get this.
We know LLMs often hallucinate tool call results, even when not in chats with particular humans.
This is a case of LLMs hallucinating a tool call result.
The hallucinated result is looks like what you wanted, because if it were real, it would be what you wanted.
Like if an LLM hallucinated the results of a fake tool-call to a weather reporting servicing, it will hallucinate something that looks like actual weather reports, and will not hallucinate a recipe for banana bread.
Similarly an “actual” hallucination about a PDF is probably going to spit up something that might realistically be in the PDF, given the prior conversation—it’s not probably gonna hallucinate something that conveniently is not what you want! So yeah, it’s likely to look like what you wanted, but that’s not because it’s optimizing to deceive you, it’s just because that’s what its subconscious spits up.
“Hallucination” seems like a sufficiently explanatory hypotheses. “Lying” seems like it is unnecessary by Occam’s razor.
I mean, maybe there is a bit of self-deception going on, though what that looks like in LLMs looks messy.
But it’s clear that the hallucinations point in the direction of sycophancy, and also clear that the LLM is not trying very hard not to lie, despite this being a thing I obviously care quite a bit about (and the LLM knows this).
If you want to call them “sycophantically adversarial selective hallucinations”, then sure, but I honestly think “lying” is a better descriptor, and more predictive of what LLMs will do in similar situations.
I would also simply bet that if we had access to the CoT in the above case, the answer to what happened would not look that much like “hallucinations”. It would look more like “the model realized it can’t read it, kind of panicked, tried some alternative ways of solving the problem, and eventually just output this answer”. Like, I really don’t think the model will have ended up in a cognitive state where it thought it could read the PDF, which is what “hallucination” would imply.
I still don’t get this.
We know LLMs often hallucinate tool call results, even when not in chats with particular humans.
This is a case of LLMs hallucinating a tool call result.
The hallucinated result is looks like what you wanted, because if it were real, it would be what you wanted.
Like if an LLM hallucinated the results of a fake tool-call to a weather reporting servicing, it will hallucinate something that looks like actual weather reports, and will not hallucinate a recipe for banana bread.
Similarly an “actual” hallucination about a PDF is probably going to spit up something that might realistically be in the PDF, given the prior conversation—it’s not probably gonna hallucinate something that conveniently is not what you want! So yeah, it’s likely to look like what you wanted, but that’s not because it’s optimizing to deceive you, it’s just because that’s what its subconscious spits up.
“Hallucination” seems like a sufficiently explanatory hypotheses. “Lying” seems like it is unnecessary by Occam’s razor.
I mean, maybe there is a bit of self-deception going on, though what that looks like in LLMs looks messy.
But it’s clear that the hallucinations point in the direction of sycophancy, and also clear that the LLM is not trying very hard not to lie, despite this being a thing I obviously care quite a bit about (and the LLM knows this).
If you want to call them “sycophantically adversarial selective hallucinations”, then sure, but I honestly think “lying” is a better descriptor, and more predictive of what LLMs will do in similar situations.
I would also simply bet that if we had access to the CoT in the above case, the answer to what happened would not look that much like “hallucinations”. It would look more like “the model realized it can’t read it, kind of panicked, tried some alternative ways of solving the problem, and eventually just output this answer”. Like, I really don’t think the model will have ended up in a cognitive state where it thought it could read the PDF, which is what “hallucination” would imply.