Predictors as Agents

[UPDATE: I have concluded that the argument in this post is wrong. In particular, consider a generative model. Say the generative model has a choice between two possible ‘fixed point’ predictions A and B, and currently assigns A 70% probability and B 30% probability. Then the target distribution it is trying to match is A 70 % B 30%, so it will just stay like that forever(or drift, likely increasing the proportion of A). This is true even if B is easier to obtain good predictions for—the model will shift from “70 % crappy model of A /​ 30% good model to B” --> “70% slightly better model of A /​ 30% good model of B”. It won’t increase the fraction of B.

In general this means that the model should converge to a distribution of fixed points corresponding to the learning bias of the model—‘simpler’ fixed points will have higher weight. This might end up looking kind of weird anyway, but it won’t perform optimization in the sense I described below.]


In machine learning, we can make the distinction between predictive and agent-like systems. Predictive systems include classifiers or language models. Agent-like systems are the domain of reinforcement learning and include AlphaZero, OpenAI Five, etc. While predictive systems are passive, modelling a relationship between input and output, agent-like systems perform optimization and planning.

It’s well-known around here that agent-like systems trained to optimize a given objective can exhibit behavior unexpected by the agent’s creators. It is also known that systems trained to optimize one objective can end up optimizing for another, because optimization can spawn sub-agents with different objectives. Here I present another type of unexpected optimization: in realistic settings, systems that are trained purely for prediction can end up behaving like agents.

Here’s how it works. Say we are training a predictive model on video input. Our system is connected to a video camera in some rich environment, such as an AI lab. It receives inputs from this camera, and outputs a probability distribution over future inputs(using something like a VAE, for instance). We train it to minimize the divergence between its predictions and the actual future inputs, exponentially decaying the loss for inputs farther in the future.

Because the predictor is embedded in the environment, its predictions are not just predictions; they affect the future dynamics of the environment. For example, if the predictor is very powerful, the AI researchers could use it to predict how a given research direction or hiring decision will turn out, by conditioning the model on making that decision. Then their future actions will depend on what the model predicts.

If the AI system is powerful enough, it will learn this; its model of the environment will include its own predictions. For it to obtain an accurate prediction, it must output a fixed-point: a prediction about future inputs which, when instantiated in the environment, causes that very prediction to come about. The theory of reflective oracles implies that such (randomized) fixed points must exist; if our model is powerful enough, it will be able to find them.

The capacity for agency arises because, in a complex environment, there will be multiple possible fixed-points. It’s quite likely that these fixed-points will differ in how the predictor is scored, either due to inherent randomness, logical uncertainty, or computational intractability(predictors could be powerfully superhuman while still being logically uncertain and computationally limited). Then the predictor will output the fixed-point on which it scores the best.

As a simple example, imagine a dispute between two coworkers, Alice and Bob, in the AI lab; each suspects the other of plotting against them. If Bob is paranoid, he could check the predictions of the AI system to see what Alice will do. If the AI system predicts that Alice will publicly denounce Bob, this could confirm his suspicions and cause him to spread rumors about Alice, leading her to publicly denounce him. Or, if the AI system predicts that Alice will support Bob on a key issue, Bob could conclude he was wrong all along and try to reconcile with Alice, leading her to support him on a key issue. The AI system will prefer to output whichever branch of the prediction is simpler to predict.

A more extreme example. If the predictor is VERY superhuman, it could learn an adversarial fixed-point for humans: a speech which, when humans hear it, causes them to repeat the speech, then commit suicide, or otherwise act in a very predictable manner. It’s not clear if such a speech exists; but more broadly, in a complex environment, the set of fixed-points is probably very large. Optimizing over that set can produce extreme outcomes.

These same problems could arise within AI systems which use predictors as a component, like this system, which contains a predictive model optimized for predictive accuracy, and a policy network optimizing for some objective. The policy network’s decisions will depend on what the predictive model says is likely to happen, influencing what the predictive model ends up seeing. The predictor could then steer the overall system towards more predictable fixed-points, even if those fixed-points obtain less value on the objective the policy network is supposed to be optimizing for.

To some extent, this seems to undermine the orthogonality thesis; an arbitrary predictor can’t just be plugged into an arbitrary objective and be counted on to optimize well. From the perspective of the system, it will be trying to optimize its objective as well as it can given its beliefs; but those ‘beliefs’ may themselves be optimized for something quite different. In self-referential systems, beliefs and decisions can’t be easily separated.