AI chatbots don’t know why they did it

Link post

[Epistemic status: a basic explanation, high confidence.]

Chatting with an AI seems like chatting with a person, so it’s natural to ask an AI chatbot to explain its answers. But it doesn’t work how you’d expect. An AI chatbot doesn’t know how it decided what to write, so its explanation can only match its previous reasoning by coincidence.[1]

How can I be sure of this? Based on how the API works, we know that a chatbot has no short-term memory.[2] The only thing a chatbot remembers is what it wrote. It forgets its thought process immediately.

Yes, this is weird. If someone asked you why you did something yesterday, it’s understandable that you might not remember how you decided to do it. But if it were something you did a minute ago, you probably have some idea what you were thinking at the time. Forgetting your own thoughts so quickly might occasionally happen (“senior moment”), but having no short-term memory at all is a severe mental deficiency.

““Senator, I don’t recall. Ever. Let me check my notes.”

But it seems to work. What happens when you ask a chatbot to explain its answer?

It will make up a plausible justification. It will try to justify its answer the way a human would,[3] because it’s trained on writing justifications like people. Its justification will have little to do with how the language model really chooses what to write. It has no mechanism or training to do real introspection.

Often, chatbots are able to give reasonable justifications in reasonable situations, when there are likely similar justifications in their training set. But if you ask it to explain an off-the-wall answer, it may very well say something surreal.

How should this be fixed? The bug isn’t giving a bad justification, it’s trying to answer the question at all. Users will be misled by their intuitions about how people think.

OpenAI has trained ChatGPT to refuse to answer some kinds of questions. It could refuse these questions too. The chatbot could say “I don’t remember how I decided that” whenever someone asks it for a justification, so that people learn what’s going on. This might seem like unhelpful evasion, but it’s literally true!

Maybe it could speculate anyway? “I don’t remember how I decided that, but a possible justification is…”

We’re on firmer ground when a chatbot explains its reasoning and then answers the question. We don’t really know how LLM’s think so there is no guarantee, but chain-of-thought reasoning improves benchmark scores, indicating that chatbots often do use their previous reasoning to decide on their answers.

Sometimes ChatGPT will use chain-of-thought reasoning its own. If it doesn’t, you can ask it to do that in a new chat session. (This avoids any possibility that it’s biased by its previous answer.)

  1. ^

    By “reasoning” or “thought process,” I mean how a chatbot decides what to write. The details are mysterious, but hopefully researchers will figure it out someday.

  2. ^

    ChatGPT’s API is stateless and requires the caller to pass in previous chat history. I wrote about some consequences of this before. It’s possible that the interactive version of ChatGPT works differently, but I think that’s unlikely and it would probably be discovered.

  3. ^

    Specifically, it answers the way the character it’s pretending to be would.