How LLMs are and are not myopic
Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture.
TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text.
Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness).
Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents.
LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game.
LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy.
You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws.
LLM pretraining on human data is consequence-blind as the training data is causally independent from the model’s actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier.
When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It’s not clear how this affects the training goal theoretically or in practice.
A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed
The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There’s also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work.
Types of Myopia
1. Cognitive Myopia
One natural definition for myopia is that the model doesn’t think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabilities handicap, as many tasks require some degree of forward planning or anticipation of future events.
LLM pretraining is not cognitively-myopic. Even though LLMs like GPT are optimized for next-token prediction and use causal masking which hides the future from current predictions, there is still a direct incentive to think about the future because it can be useful for immediate prediction accuracy. In a game of chess, efficiently computing the best move likely involves reasoning about how your opponent will respond multiple moves into the future. In other words, we should expect GPT to think about the future because it’s instrumental for predicting the present.
2. Value/Prediction Myopia
Value myopia refers to agents that place no value on future states or rewards. In the case of a predictive model, we mean the model cares only about the accuracy of its next prediction. Intuitively, it seems like this would rule out deceptive alignment and treacherous turns, because a value-myopic agent should be unwilling to sacrifice immediate value for long-term value. Unfortunately, value-myopic agents might still have non-myopic incentives for reasons like anthropic uncertainty or acausal trade. (see Open Problems with Myopia for more details)
In any case, LLM training actually incentivizes value non-myopia. Even though next-token prediction accuracy is a nominally myopic objective, the transformer architecture actually causes the model to be optimized for accuracy over the entire sequence, not just the next token.
In a causal-masked transformer, attention layers can query the previous layers’ activations from any column in the context window. Gradients flow through the attention connections, so each previous layer is optimized not just to improve prediction accuracy for the next token, but also to produce values that are useful for future columns to attend to when predicting their token.
This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. In particular, we should expect trained models to find an efficient tradeoff between current prediction accuracy and full-sequence prediction accuracy. When predicting relatively easy tokens, they will likely spend most of their computation budget preparing for the future. If the model is given a hard problem that it will need to solve later in the sequence, it will likely work on the problem (‘in the back of its head’) throughout the intermediate predictions.
Several interpretability results including ROME confirm this type of non-myopic computation in LLMs. ROME shows that LLMs recall factual information about nouns when the noun first appears, even though this information is only used later when predicting the answer to a question about the noun. This information would be irrelevant and thus wasted computation for the purpose of predicting only the next token. For example, if the model sees the text “The Eiffel Tower”, it immediately begins retrieving information about the Eiffel Tower like where it is located even though that’s not necessary to predict the next token which is almost certainly “is”.
It is possible to modify the transformer architecture to enforce value (prediction accuracy) myopia by placing stop gradients in the attention layers. This effectively prevents past activations from being directly optimized to be more useful for future computation. We ran several informal experiments on models like these while at Conjecture. Unfortunately, we do not have quantitative results to share here. The experiments were preliminary and we moved on to other aspects of the project, so don’t take this as strong evidence.
Specifically, we trained a set of four traditional and four myopic transformers ranging from 117M to 1.5B parameters (equivalent to GPT-2 Small to GPT-XL). Each model was trained on the same data but training hyperparameters were tuned to each architecture individually using maximal update parameterization.
We found the performance reduction from myopia was minimal at 117M parameters, but the performance cost increased with scale, i.e. myopic transformers have worse scaling laws.
A third type of myopia to consider is consequence-blindness, where a model chooses actions completely independent of any effect of its actions on the future. This is similar to the goal of Counterfactual Oracles.
Consequence-blindness should rule out most types of instrumental convergence and concerns about self-fulfilling prophecies. A model which completely ignores the effects of its actions has no instrumental incentive to pursue traditional instrumental goals, like trying to accumulate resources to become more powerful, trying to prevent its own shutdown, or pretending to be aligned in order to defect later. However, consequence-blindness does not actually constrain the behavior of a model, because the model can pursue any instrumental goal as a terminal value.
A consequence-blind simulator that predicts power-seeking agents (like humans) will still predict actions which seek power, but these actions will seek power for the simulated agent, not the simulator itself. I usually think about problems like this as simulator vs simulacra alignment. If you successfully build an inner aligned simulator, you can use it to faithfully simulate according to the rules it learns and generalizes from its training distribution. However you are still left with the problem of extracting consistently aligned simulacra.
In theory, consequence-blindness doesn’t rule out any capabilities, because a consequence-blind predictor could learn to predict any behavior. However, in practice using a consequence-blind training goal like pure imitation learning may be uncompetitive compared to methods like RL (or imitation + RL finetuning, the current dominant paradigm).
Consequence-blind agents (with a causal decision theory) can be seen as implementing a Lonely Causal Decision Theory (LCDT). An LCDT agent assumes that every other decision node of agents in the world (including its future decisions) are causally independent of its actions. This means it has no incentive to take actions which help its future itself or other agents for instrumental reasons.
Unlike the other forms of myopia above, the training goal for LLMs trained with self-supervised learning (SSL) is theoretically consequence-blind. In supervised or self-supervised learning, the training data already exists and is assumed to be causally independent from the model’s decisions. This means a model’s prediction should be based only on the likelihood of the output appearing in the training data. In particular, the model’s prediction should be independent of any effect from making the prediction itself, including whether or not the prediction would make the model more likely to predict or control the future correctly when run autoregressively.
The distinction between optimizing prediction accuracy and steering the distribution to be easier to predict is one of the most common sources of confusion about LLM myopia. Even though the LLM training goal is not value-myopic and optimizes for prediction accuracy across entire training examples, LLMs are not incentivized to predict tokens that make the future easier to predict.
Unfortunately, the analysis of consequence-blindness story becomes more complicated when the training data is not actually causally independent from the model’s prediction.
Self-reference and Model Generated Data
In practice, the assumption that the supervised learning distribution is causally independent from the model’s predictions may be violated. This can happen whenever a model is trained on data correlated with its behavior, or if the model thinks the correct prediction is correlated with its behavior.
This is concerning because it’s not at all clear what a model that is predicting itself should output. It breaks many of the intuitions of why it should be safe to use LLMs as simulators of text distributions.
This problem is already apparent in current LLMs and is likely to become worse for several reasons.
Increasingly large portions of LLM pretraining and finetuning datasets will be generated or influenced by LLMs themselves
LLMs are often trained or used in ways designed to make them more situationally aware that they are an AI assistant responding to a user’s prompt
In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
We could try to combat this problem by excluding model generated text from all future training data, but it is still possible for a model trained on human-written text to drift off distribution and become situationally aware. You could see this as a training goal (outer alignment) problem: the decision-myopic base objective has to generalize somehow to self-referential predictions which are fundamentally not consequence-blind.
We could also try the opposite of actually increasing the amount of model generated text in the training dataset. This could allow us to shape the LLMs prior on how LLMs (and thus itself) should behave. However it is still possible for a model to realize (or believe) that it is different from the types of models in the training data and should thus act differently.
In this vein, there have been several great posts that discuss how the process of repeatedly training a model on its own predictions might converge. There has also been work considering the general implications of training a simulator in a closed loop. However the broader implications for generalization and alignment are unclear.
Myopic Training Goals vs Myopic Models
It is also important to note that even if a training goal is designed to be myopic in some way, the resulting model may not be myopic. Inner alignment failures can lead to non-myopic models emerging from myopic training goals. Finding a solution to inner alignment, or getting inner alignment by default does seem relatively likely for predictive SSL over other training goals, but it is not guaranteed. Many researchers believe the cognitive structures that are required to predict the answers to hard consequentialist problems will fundamentally be non-myopic, especially if these structures become situationally aware. Some examples.
It would be a huge success if we could find some way to enforce or verify that a model’s internal computation satisfies some myopic criteria (or any criteria…) during or after training. However, it’s not clear how we would go about this.
The ideas in the post are from a human, but most of the text was written by Chat GPT-4 with prompts and human curation using Loom. I endorse the post as technically correct and accurately phrased according to my understanding. Here is the second of two Loom trees used to generate most of the post before final edits.