I’ve just reached the interlude. Here are my initial thoughts on “What points above fail, if any?”
It doesn’t have any wants
Maybe, but the things that it predicts do have wants.
It doesn’t plan
“maximizing actual probabilities of actual texts” encompasses predicting plans.
Its mental time span is precisely one forward pass through the network
No, (as your story shows,) its mental time span is based on its context window and the imagined past that this context window could imply. GPT is a process which can send information to its future by repeatedly writing to its prompt. A few pages of text is enough to iterate on plans, unroll thoughts directed by explicitly or implicitly stated intentions, etc. Factored cognition and chain-of-thought reasoning can outperform single-step inference. It can also rewrite important details to the prompt before they fall out of the context window. This is all somewhat higher bandwidth than it seems because the attention mechanism allows GPT to attend to computation about previous tokens rather than only the previous tokens themselves.
It can only use ideas that the rest of the world knows
The rest of the world doesn’t know what the rest of the world knows. And who knows what this means for the space of concepts reachable by interpolation/extrapolation.
The model has not been trained to have a conception of itself as a specific non-hypothetical thing … If it has a ‘self’, that self is optimised to embody whatever matches the text that prompted it, not the body that the model is running on.
It knows about language models. It shouldn’t have an unconditioned prior that the author of the text is a language model, but may become more calibrated to that true belief during downstream generation. E.g. a character tests whether they have control over the world or can instantiate other entities with words and finds they do, or it the model produces aberrations like a loop and subsequently identifies it as characteristic of language model output.
All this is ignoring inner alignment failures and amplification schemes like RL on top of the pretrained GPT that could invalidate pretty much any of the rest of the points.
Some of these thoughts were meant to be preempted in the text, like “perhaps one instantiation could start forming plans across other instantiations, using its previous outputs, but it’s a text-prediction model, it’s not going to do that because it’s directly at odds with its trained goal to produce the rewarded output.”
Namely, it’s not enough to say that the model can work around the limits of its context window when planning, it also needs to decide to do it despite the fact that almost none of the text it was trained on would have encouraged that behavior. Backpropagation really strongly enforces that the behavior of a model is directed towards doing well at what it is trained on, so it isn’t immediately clear how that could happen.
If this behavior of repeating previous text in the context in order to prevent it falling off the back was ever to show up during the training loop outside of times when it was explicitly modelling a person pretending to be a misaligned model, it would be heavily penalized. That’s not something you can do at a sufficiently low loss.
Still, this is the right direction to be thinking in, since it isn’t a strong enough argument, and it might not hold at some inconvenient future point.
By large the points you mentioned are part of the failure later in the story. The generated agent does have wants, does plan, does work around its context limits, does extrapolate beyond human designs, and does bootstrap into having self knowledge.
I’ve just reached the interlude. Here are my initial thoughts on “What points above fail, if any?”
Maybe, but the things that it predicts do have wants.
“maximizing actual probabilities of actual texts” encompasses predicting plans.
No, (as your story shows,) its mental time span is based on its context window and the imagined past that this context window could imply. GPT is a process which can send information to its future by repeatedly writing to its prompt. A few pages of text is enough to iterate on plans, unroll thoughts directed by explicitly or implicitly stated intentions, etc. Factored cognition and chain-of-thought reasoning can outperform single-step inference. It can also rewrite important details to the prompt before they fall out of the context window. This is all somewhat higher bandwidth than it seems because the attention mechanism allows GPT to attend to computation about previous tokens rather than only the previous tokens themselves.
The rest of the world doesn’t know what the rest of the world knows. And who knows what this means for the space of concepts reachable by interpolation/extrapolation.
It knows about language models. It shouldn’t have an unconditioned prior that the author of the text is a language model, but may become more calibrated to that true belief during downstream generation. E.g. a character tests whether they have control over the world or can instantiate other entities with words and finds they do, or it the model produces aberrations like a loop and subsequently identifies it as characteristic of language model output.
All this is ignoring inner alignment failures and amplification schemes like RL on top of the pretrained GPT that could invalidate pretty much any of the rest of the points.
Thanks for taking a shot!
Some of these thoughts were meant to be preempted in the text, like “perhaps one instantiation could start forming plans across other instantiations, using its previous outputs, but it’s a text-prediction model, it’s not going to do that because it’s directly at odds with its trained goal to produce the rewarded output.”
Namely, it’s not enough to say that the model can work around the limits of its context window when planning, it also needs to decide to do it despite the fact that almost none of the text it was trained on would have encouraged that behavior. Backpropagation really strongly enforces that the behavior of a model is directed towards doing well at what it is trained on, so it isn’t immediately clear how that could happen.
If this behavior of repeating previous text in the context in order to prevent it falling off the back was ever to show up during the training loop outside of times when it was explicitly modelling a person pretending to be a misaligned model, it would be heavily penalized. That’s not something you can do at a sufficiently low loss.
Still, this is the right direction to be thinking in, since it isn’t a strong enough argument, and it might not hold at some inconvenient future point.
By large the points you mentioned are part of the failure later in the story. The generated agent does have wants, does plan, does work around its context limits, does extrapolate beyond human designs, and does bootstrap into having self knowledge.