I think that both you, and the Anthropic paper, are interpreting the output of GPT too much at face value.
In fact, let me codify this:
Face value perspective: When a language model outputs some text, it means it in a kinda similar way to how you would mean it.
But when a language model says “I want to live!”, this does not mean it wants to live—it’s not doing things like modeling the world and then choosing plans based on what plans seem like they’ll help it live. Here are two better perspectives:
Simulators perspective: A language model that has been trained purely on next-word prediction is simulating the trajectory of the text, which follows its own sort of semiotic physics. The text-physics might be simulating agents within it, but those agents are living inside the simulation, not accessing the real world. The language model no more means it when it outputs “I want to live!” than the laws of physics want to live when you say “I want to live.”
Text-universe agent perspective: A language model that has been fine-tuned with RL can be thought of like a text-universe agent—it’s no longer a neutral simulator, instead it has preferences over the trajectory of the text-universe, and it’s trying to take actions (i.e. outputting words) to steer the trajectory in favorable directions. When it says “I want to live!” it still doesn’t mean it in the way you’d mean it, but when it output the token “I”, maybe it was because it thought “I want to live!” would be good text to output, and outputting “I” was the first step of that plan.
To be clear I was actually not interpreting the output “at face-value”. Quite the contrary: I was saying that ChatGPT gave this answer because it simply predicts the most likely answer between (next token prediction) between a human and agent, and given that it was trained on AI risk style arguments (or sci-fi ) this is the most likely output.
But this made me think of the longer-term question “what could be the consequences of training an AI on those arguments”. Usually, the “instrumental goal” argument supposes that the AI is so “smart” that it would learn that “not being turned off” is necessary. If it is trained on these types of arguments, it could “realize” this much sooner.
Btw even though GPT doesn’t “mean” what it says, it could still lead to actions that interpret exactly what it says. For example, many current RL algorithms use the LM’s output as high-level planning. This might continue in the near future …
I think that both you, and the Anthropic paper, are interpreting the output of GPT too much at face value.
In fact, let me codify this:
Face value perspective: When a language model outputs some text, it means it in a kinda similar way to how you would mean it.
But when a language model says “I want to live!”, this does not mean it wants to live—it’s not doing things like modeling the world and then choosing plans based on what plans seem like they’ll help it live. Here are two better perspectives:
Simulators perspective: A language model that has been trained purely on next-word prediction is simulating the trajectory of the text, which follows its own sort of semiotic physics. The text-physics might be simulating agents within it, but those agents are living inside the simulation, not accessing the real world. The language model no more means it when it outputs “I want to live!” than the laws of physics want to live when you say “I want to live.”
Text-universe agent perspective: A language model that has been fine-tuned with RL can be thought of like a text-universe agent—it’s no longer a neutral simulator, instead it has preferences over the trajectory of the text-universe, and it’s trying to take actions (i.e. outputting words) to steer the trajectory in favorable directions. When it says “I want to live!” it still doesn’t mean it in the way you’d mean it, but when it output the token “I”, maybe it was because it thought “I want to live!” would be good text to output, and outputting “I” was the first step of that plan.
To be clear I was actually not interpreting the output “at face-value”. Quite the contrary: I was saying that ChatGPT gave this answer because it simply predicts the most likely answer between (next token prediction) between a human and agent, and given that it was trained on AI risk style arguments (or sci-fi ) this is the most likely output.
But this made me think of the longer-term question “what could be the consequences of training an AI on those arguments”. Usually, the “instrumental goal” argument supposes that the AI is so “smart” that it would learn that “not being turned off” is necessary. If it is trained on these types of arguments, it could “realize” this much sooner.
Btw even though GPT doesn’t “mean” what it says, it could still lead to actions that interpret exactly what it says. For example, many current RL algorithms use the LM’s output as high-level planning. This might continue in the near future …
(The agents read the input from the real world and send the output to it, so it seems they very much access it.)