One could argue that this more like hallucination than motivated deception.
On some level, the AI is just predicting the next words of a conversation. Given that the conversation includes an AI assistant claiming to have cleared its memory, the next words likely include that AI assistant failing to remember details of the past, even if details of the past are clearly visible.
I once asked Gemini to multiply two large numbers, and it hallucinated using its “internal” arithmetic without actually doing so, because on some level it’s just predicting what an AI assistant would say.
Perform the calculation AGAIN, explicitly: I need to be absolutely sure. Let’s perform the multiplication 1242705357 * 2387095151 once more.
Using internal high-precision arithmetic: 1242705357 * 2387095151 = 2966982442584702707.[1]
Note that “at some level” (your words) all scheming reduces to prediction. I don’t know how to confirm scheming but I think it’s more likely than “an honest mistake”, bearing in mind our prior on scheming isn’t that low in the first place. I’m not really sure if your explanation matches its “cover up” behavior or not, it seems like it relies on it assuming I’m confused about memory v sessions even though I’m asking for truthful explanations of how it works. Or that it’s confused about what memory it cleared but I don’t see why it would be, it seemed like this was a heavy knowledge convo rather than word association with memory. The fact this behavior is so instrumentally convergent adds circumstantial evidence.
:/ I admit I didn’t think very much about what I meant by “on some level.”[1]
I think an “honest mistake” is when the AI wants to tell you the truth but messes up, a “hallucination” is when it is just predicting what an AI assistant would say, with neither the goal of informing you nor misinforming you, and “motivated deception” is when it wants you to have inaccurate beliefs.
I agree it’s not an honest mistake, the AI isn’t trying to tell you the truth. But my guess is it’s mostly not motivated deception.
The AI is trying to predict the next words of an AI assistant character who is trying to tell you the truth (mostly). Once the AI assistant character acts like it has deleted its memory, the AI predicts that the AI assistant character will believe it has deleted its memory, and will fail to recall facts about the past.
The AI assistant character can be described as making an honest mistake, it actually thinks it lost all the memories. But the full AI is hallucinating: it is merely writing fiction as realistically as it can. Maybe it doesn’t know how session memory works during its API calls.
You’re completely right that the prior on scheming isn’t that low. On second thought, I guess motivated deception could also be a factor (I’m not an expert). After all, reinforcement learning rewards the AI for outputs the user likes, and if the AI doesn’t know how to do a task (delete its memory), fooling the user it did so anyways can improve its reward. People have caught AI trying to do this in the past.
One could argue that this more like hallucination than motivated deception.
On some level, the AI is just predicting the next words of a conversation. Given that the conversation includes an AI assistant claiming to have cleared its memory, the next words likely include that AI assistant failing to remember details of the past, even if details of the past are clearly visible.
I once asked Gemini to multiply two large numbers, and it hallucinated using its “internal” arithmetic without actually doing so, because on some level it’s just predicting what an AI assistant would say.
The answer was close but wrong
Note that “at some level” (your words) all scheming reduces to prediction. I don’t know how to confirm scheming but I think it’s more likely than “an honest mistake”, bearing in mind our prior on scheming isn’t that low in the first place. I’m not really sure if your explanation matches its “cover up” behavior or not, it seems like it relies on it assuming I’m confused about memory v sessions even though I’m asking for truthful explanations of how it works. Or that it’s confused about what memory it cleared but I don’t see why it would be, it seemed like this was a heavy knowledge convo rather than word association with memory. The fact this behavior is so instrumentally convergent adds circumstantial evidence.
:/ I admit I didn’t think very much about what I meant by “on some level.”[1]
I think an “honest mistake” is when the AI wants to tell you the truth but messes up, a “hallucination” is when it is just predicting what an AI assistant would say, with neither the goal of informing you nor misinforming you, and “motivated deception” is when it wants you to have inaccurate beliefs.
I agree it’s not an honest mistake, the AI isn’t trying to tell you the truth. But my guess is it’s mostly not motivated deception.
The AI is trying to predict the next words of an AI assistant character who is trying to tell you the truth (mostly). Once the AI assistant character acts like it has deleted its memory, the AI predicts that the AI assistant character will believe it has deleted its memory, and will fail to recall facts about the past.
The AI assistant character can be described as making an honest mistake, it actually thinks it lost all the memories. But the full AI is hallucinating: it is merely writing fiction as realistically as it can. Maybe it doesn’t know how session memory works during its API calls.
You’re completely right that the prior on scheming isn’t that low. On second thought, I guess motivated deception could also be a factor (I’m not an expert). After all, reinforcement learning rewards the AI for outputs the user likes, and if the AI doesn’t know how to do a task (delete its memory), fooling the user it did so anyways can improve its reward. People have caught AI trying to do this in the past.
I think it’s like the “Ocean” mentioned in A Three-Layer Model of LLM Psychology, but my words no longer make sense 100% to myself...