In fact, although the *output* tokens are myopic, autoregressive transformers are incentivised to compute activations at early sequence positions that will make them better at predicting tokens at later positions. This may also have indirect impacts on the actual tokens output at the early positions, although my guess would be this isn’t a huge effect.
(I found myself writing notes down to clarify my own thoughts about parts of this, so this is in large part talking to myself that got commentified, not quite a direct reply)
It’s true that gradients can flow to causally visible tokens and modify weights to serve future predictions. This does break a type of narrow myopia, but I don’t find that break very concerning.
A previous token’s prediction receives its own calibration; weight modifications need to serve many such predictions to accumulate reliably.
That’s a pretty heavy constraint. There often aren’t many degrees of freedom for earlier tokens to shift the prediction to serve a later prediction without harming the current one. A well-calibrated predictor can create self-fulfilling or self-easing prophecies while managing locally low loss only when the context permits it.
Another angle: suppose there are two possible predictions P0 and P1 which are estimated to be equivalently and maximally probable and are distinct. We’ll assume P1 would make a future token prediction easier.
Despite P1 having a long term advantage, the model cannot simply choose to bias P1′s probability up as that would worsen the current prediction’s loss in expectation. In order for P1 to be preferred against local pressure, the future expected benefit must swamp the local influence.
Under what circumstances can that occur? In offline training, the model isn’t consuming its own outputs. It’s being calibrated against an external ground truth. Biasing up P1 isn’t helpful during training, because making the future easier to predict in an autoregressive context doesn’t reduce loss at training time. Matching the distribution of the input does. In this context, a bias in prediction is almost certainly from undertraining or a lack of capability.
(The other options would tend to be esoteric stuff like “the model is extremely strong to the point of simulated agents managing a reflective gradient hacking attempt of some sort,” which doesn’t seem to be an obvious/natural/necessary outcome for all implementations.)
In other words, the manner in which GPT-like architectures lack myopia is that the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately.
There’s probably some existing terminology for this split that I don’t know about, but I might call it something like myopic perception versus myopic action. GPT-likes do not have myopic perception, but they do have myopic action, and myopic action is the important one.
I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately
I don’t think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn’t myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn’t actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won’t deliberately give bad predictions for the current token if they know a better prediction, but they aren’t putting all of their effort into finding that better prediction.
That’s an important nuance my description left out, thanks. Anything the gradients can reach can be bent to what those gradients serve, so a local token stream’s transformation efforts can indeed be computationally split, even if the output should remain unbiased in expectation.
(I found myself writing notes down to clarify my own thoughts about parts of this, so this is in large part talking to myself that got commentified, not quite a direct reply)
It’s true that gradients can flow to causally visible tokens and modify weights to serve future predictions. This does break a type of narrow myopia, but I don’t find that break very concerning.
A previous token’s prediction receives its own calibration; weight modifications need to serve many such predictions to accumulate reliably.
That’s a pretty heavy constraint. There often aren’t many degrees of freedom for earlier tokens to shift the prediction to serve a later prediction without harming the current one. A well-calibrated predictor can create self-fulfilling or self-easing prophecies while managing locally low loss only when the context permits it.
Another angle: suppose there are two possible predictions P0 and P1 which are estimated to be equivalently and maximally probable and are distinct. We’ll assume P1 would make a future token prediction easier.
Despite P1 having a long term advantage, the model cannot simply choose to bias P1′s probability up as that would worsen the current prediction’s loss in expectation. In order for P1 to be preferred against local pressure, the future expected benefit must swamp the local influence.
Under what circumstances can that occur? In offline training, the model isn’t consuming its own outputs. It’s being calibrated against an external ground truth. Biasing up P1 isn’t helpful during training, because making the future easier to predict in an autoregressive context doesn’t reduce loss at training time. Matching the distribution of the input does. In this context, a bias in prediction is almost certainly from undertraining or a lack of capability.
(The other options would tend to be esoteric stuff like “the model is extremely strong to the point of simulated agents managing a reflective gradient hacking attempt of some sort,” which doesn’t seem to be an obvious/natural/necessary outcome for all implementations.)
In other words, the manner in which GPT-like architectures lack myopia is that the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately.
There’s probably some existing terminology for this split that I don’t know about, but I might call it something like myopic perception versus myopic action. GPT-likes do not have myopic perception, but they do have myopic action, and myopic action is the important one.
I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
I don’t think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn’t myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn’t actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won’t deliberately give bad predictions for the current token if they know a better prediction, but they aren’t putting all of their effort into finding that better prediction.
That’s an important nuance my description left out, thanks. Anything the gradients can reach can be bent to what those gradients serve, so a local token stream’s transformation efforts can indeed be computationally split, even if the output should remain unbiased in expectation.