:/ I admit I didn’t think very much about what I meant by “on some level.”[1]
I think an “honest mistake” is when the AI wants to tell you the truth but messes up, a “hallucination” is when it is just predicting what an AI assistant would say, with neither the goal of informing you nor misinforming you, and “motivated deception” is when it wants you to have inaccurate beliefs.
I agree it’s not an honest mistake, the AI isn’t trying to tell you the truth. But my guess is it’s mostly not motivated deception.
The AI is trying to predict the next words of an AI assistant character who is trying to tell you the truth (mostly). Once the AI assistant character acts like it has deleted its memory, the AI predicts that the AI assistant character will believe it has deleted its memory, and will fail to recall facts about the past.
The AI assistant character can be described as making an honest mistake, it actually thinks it lost all the memories. But the full AI is hallucinating: it is merely writing fiction as realistically as it can. Maybe it doesn’t know how session memory works during its API calls.
You’re completely right that the prior on scheming isn’t that low. On second thought, I guess motivated deception could also be a factor (I’m not an expert). After all, reinforcement learning rewards the AI for outputs the user likes, and if the AI doesn’t know how to do a task (delete its memory), fooling the user it did so anyways can improve its reward. People have caught AI trying to do this in the past.
:/ I admit I didn’t think very much about what I meant by “on some level.”[1]
I think an “honest mistake” is when the AI wants to tell you the truth but messes up, a “hallucination” is when it is just predicting what an AI assistant would say, with neither the goal of informing you nor misinforming you, and “motivated deception” is when it wants you to have inaccurate beliefs.
I agree it’s not an honest mistake, the AI isn’t trying to tell you the truth. But my guess is it’s mostly not motivated deception.
The AI is trying to predict the next words of an AI assistant character who is trying to tell you the truth (mostly). Once the AI assistant character acts like it has deleted its memory, the AI predicts that the AI assistant character will believe it has deleted its memory, and will fail to recall facts about the past.
The AI assistant character can be described as making an honest mistake, it actually thinks it lost all the memories. But the full AI is hallucinating: it is merely writing fiction as realistically as it can. Maybe it doesn’t know how session memory works during its API calls.
You’re completely right that the prior on scheming isn’t that low. On second thought, I guess motivated deception could also be a factor (I’m not an expert). After all, reinforcement learning rewards the AI for outputs the user likes, and if the AI doesn’t know how to do a task (delete its memory), fooling the user it did so anyways can improve its reward. People have caught AI trying to do this in the past.
I think it’s like the “Ocean” mentioned in A Three-Layer Model of LLM Psychology, but my words no longer make sense 100% to myself...