Some of these thoughts were meant to be preempted in the text, like “perhaps one instantiation could start forming plans across other instantiations, using its previous outputs, but it’s a text-prediction model, it’s not going to do that because it’s directly at odds with its trained goal to produce the rewarded output.”
Namely, it’s not enough to say that the model can work around the limits of its context window when planning, it also needs to decide to do it despite the fact that almost none of the text it was trained on would have encouraged that behavior. Backpropagation really strongly enforces that the behavior of a model is directed towards doing well at what it is trained on, so it isn’t immediately clear how that could happen.
If this behavior of repeating previous text in the context in order to prevent it falling off the back was ever to show up during the training loop outside of times when it was explicitly modelling a person pretending to be a misaligned model, it would be heavily penalized. That’s not something you can do at a sufficiently low loss.
Still, this is the right direction to be thinking in, since it isn’t a strong enough argument, and it might not hold at some inconvenient future point.
By large the points you mentioned are part of the failure later in the story. The generated agent does have wants, does plan, does work around its context limits, does extrapolate beyond human designs, and does bootstrap into having self knowledge.
Thanks for taking a shot!
Some of these thoughts were meant to be preempted in the text, like “perhaps one instantiation could start forming plans across other instantiations, using its previous outputs, but it’s a text-prediction model, it’s not going to do that because it’s directly at odds with its trained goal to produce the rewarded output.”
Namely, it’s not enough to say that the model can work around the limits of its context window when planning, it also needs to decide to do it despite the fact that almost none of the text it was trained on would have encouraged that behavior. Backpropagation really strongly enforces that the behavior of a model is directed towards doing well at what it is trained on, so it isn’t immediately clear how that could happen.
If this behavior of repeating previous text in the context in order to prevent it falling off the back was ever to show up during the training loop outside of times when it was explicitly modelling a person pretending to be a misaligned model, it would be heavily penalized. That’s not something you can do at a sufficiently low loss.
Still, this is the right direction to be thinking in, since it isn’t a strong enough argument, and it might not hold at some inconvenient future point.
By large the points you mentioned are part of the failure later in the story. The generated agent does have wants, does plan, does work around its context limits, does extrapolate beyond human designs, and does bootstrap into having self knowledge.