Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 17 Feb 2025 20:06 UTC
LW: 8 AF: 4
0
AF
Oh, I just remembered another point to make:

In my experience, and in the experience of my friends, today’s LLMs lie pretty frequently. And by ‘lie’ I mean ‘say something they know is false and misleading, and then double down on it instead of apologize.’ Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions.

I don’t remember specific examples but this sort of thing happens to me sometimes too I think. Also didn’t the o1 system card say that some % of the time they detect this sort of deception in the CoT—that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway?I

nsofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now.I

agree this feels like a fairly fixable problem—I hope the companies prioritize honesty much more in their training processes.
- 1a3orn 18 Feb 2025 16:42 UTC
  2 points
  2
  Parent
  I agree this is not good but I expect this to be fixable and fixed comparatively soon.