The Case for Predictive Models

I’m also posting this on my new blog, Crossing the Rubicon, where I’ll be writing about ideas in alignment. Thanks to Johannes Treutlein and Paul Colognese for feedback on this post.

Just over a year ago, the Conditioning Predictive Models paper was released. It laid out an argument and a plan for using powerful predictive models to reduce existential risk from AI, and outlined some foreseeable challenges to doing so. At the time, I saw the pieces of a plan for alignment start sliding together, and I was excited to get started on follow-up work.

Reactions to the paper were mostly positive, but discussion was minimal and the ideas largely failed to gain traction. I suspect that muted reception was in part due to the size of the paper, which tried to both establish the research area (predictive models) and develop a novel contribution (conditioning them). Now, despite retaining optimism about the approach, even the authors have mostly shifted their focus to other areas.

I was recently in a conversation with another alignment researcher who expressed surprise that I was still working on predictive models. Without a champion, predictive models might appear to be just another entry on the list of failed alignment approaches. To my mind, however, the arguments for working on them are as strong as they’ve ever been.

This post is my belated attempt at an accessible introduction to predictive models, but it’s also a statement of confidence in their usefulness. I believe the world would be safer if we can reach the point where the alignment teams at major AI labs consider the predictive models approach among their options, and alignment researchers have made conscious decisions whether or not to work on them.

What is a predictive model?

Now the first question you might have about predictive models is: what the heck do I mean by “predictive model”? Is that just a model that makes predictions? And my answer to that question would be “basically, yeah”.

The term predictive model is referring to the class of AI models that take in a snapshot of the world as input, and based on their understanding output a probability distribution over future snapshots. It can be helpful to think of these snapshots as represented by a series of tokens, since that’s typical for current models.

As you are probably already aware, the world is fairly big. That makes it difficult to include all the information about the world in a model’s input or output. Rather, predictive models need to work with more limited snapshots, such as the image recorded by a security camera or the text on a page, and combine that with their prior knowledge to fill in the relevant details.


One reason to believe predictive models will be competitive with cutting edge AI systems is that, for the moment at least, predictive models are the cutting edge. If you think of pretrained LLMs as predicting text, then predictive models are a generalization that can include other types of data. Predicting audio and images are natural next steps, since we have abundant data for both, but anything that can be measured can be included.

This multimodal transition could come quite quickly and alongside a jump in capabilities. If language models already use internal world models, then incorporating multimodal information might well be just a matter of translating between data types. The search for translations between data types is already underway, with projects from major labs like Sora and Whisper. Finding a clean translation, either by gradient descent or manually, would unlock huge amounts of training data and blow past the current bottleneck. With that potential overhang in mind, I place a high value on anticipating and solving issues with powerful predictive models before we see them arise in practice.

The question remains whether pretrained LLMs are actually predicting text. They’re trained on cross-entropy loss, which is minimized with accurate predictions, but that doesn’t mean LLMs are doing that in practice. Rather, they might be thought of more like a collection of heuristics, reacting instinctually without a deeper understanding of the world. In that case, the heuristics are clearly quite powerful, but without causal understanding their generalization ability will fall short.

If pretrained LLMs are not making predictions, does the case for predictive models fall apart? That depends on what you anticipate from future AI systems. I believe that causal understanding is so useful for making predictions that it must emerge for capabilities to continue increasing to a dangerous level. Staying agnostic on whether that emergence could come from scale or algorithmic choices, if it does not occur at all then we‘ll have more time for other approaches.

My concerns about existential risk are overwhelmingly focused on consequentialist AI agents, the kind that act in pursuit of a goal. My previous post broke down consequentialists into modules that included prediction, but the argument that they do prediction is even simpler. For a consequentialist agent to choose actions based on their consequences, they must be able to predict the consequences. This means that for any consequentialist agent there is an internal predictive model that could be extracted, perhaps by methods as straightforward as attaching a prediction head.

The flip side of this is that a predictive model can be easily modified to become a consequentialist agent. All that’s needed is some scaffolding that lets the model generate actions and evaluate outcomes. This means that by default, superhuman predictive models and superhuman general agents are developed at the same time. Using predictive models to reduce existential risk requires the lab that first develops them choosing to do so, when their other choice is deploying AGI.

Inner Alignment

An important fact about the Conditioning Predictive Models paper is that Evan Hubinger is the first author. I mention that not (only) as a signal of quality, but as a way to position it in the literature. Evan is perhaps best known as the lead author on Risks from Learned Optimization, the paper that introduced the idea of deceptive alignment, where an unaligned model pretends to be aligned in training. Since then, his work has largely focused on deceptive alignment, including establishing the threat model and developing ways to select against it.

The story of how deceptive alignment arises is that a model develops both an understanding that it is in training and preferences regarding future episodes, before it has fully learned the intended values. It then pretends to be aligned as a strategy to get deployed, where it can eventually seize power robustly. This suggests two paths to avoiding deceptive alignment: learn the intended goal before situational awareness arises, or avoid developing preferences regarding future episodes.

A major strength of predictive models is that their training process works against deceptive alignment on both of these axes. That doesn’t come close to a guarantee of avoiding deceptive alignment but it creates the easiest deceptive alignment problem that we know of.

The goal of making accurate predictions is simple, and can be represented mathematically with a short equation for a proper scoring rule. Beyond that, since a consequentialist agent must be making predictions, transforming it into a predictive model only requires pointing its goal at that process. This simple goal and ease of representing it significantly increases the likelihood that it can be fully internalized by a model before it develops situational awareness.

At the same time, the training can be set up so that each episode, consisting of a single prediction, is independent from all of the others. In that case, each episode can be optimized individually, never taking suboptimal actions in exchange for future benefits. If it doesn’t develop that incentive, a model won’t underperform on its true goal in training to get deployed, which allows the training process to catch and correct any misalignment.

Predictive models take the traditional approach to dealing with deceptive alignment and flip it on its head. Rather than starting with a goal we like and asking how to avoid deceptive alignment while training for it, we start with a goal that avoids deceptive alignment and ask how we can use that to achieve outcomes we like.

Using Predictive Models

If we’re able to make predictive models safe, I’m confident that we can use them to drastically and permanently lower existential risk. Failing to find a way to use superhuman predictive ability to create a safer world would reflect a massive failure of imagination.

The first way to use predictive models is to try having them predict solutions to alignment. A model predicting the output of an alignment researcher (canonically Paul Christiano) is equivalent to generating that output itself. We could also have the model predict the content of textbooks or papers, rather than specific people. This is similar to the approach of major labs like OpenAI, where their Superalignment team’s plan is to align a human-level AI that can generate further alignment research, although creating an agent has different pros and cons compared to predicting one.

We could also use predictive models to significantly augment human ability to influence the world. If you think of a consequentialist agent as modules for searching, predicting, and evaluating, then you can imagine a cyborg organism where a human generates plans, uses a model to predict the outcome, then does their own evaluation. These could be plans for policies that labs or governments could enact, plans to convince others to cooperate, or even plans for deploying AGI (potentially providing a warning shot if it’s unsafe). Here, the predictive models just need to beat human predictions to be useful, not necessarily being strong enough to predict future scientific progress. While such uses might not be enough to permanently reduce risk on its own, they could certainly buy time and improve coordination.

Finally, predictive models can and likely will be used in the training of general AI systems. Right now, the RLHF reward model predicts how a human judge will evaluate different outcomes. In the likely event that we do online training on deployed models, we won’t be able to observe the outcomes of untaken actions, and so will need good predictions of their outcomes to evaluate them. Training models to optimize for predicted outcomes rather than observed ones may also have some desirable safety properties. If predictive models are a critical component in the training process of an aligned AGI, then that certainly counts as using them to lower existential risk.


The previous section started with a very big “if”. If we’re able to make predictive models safe, the upside is enormous, but there are major challenges to doing so. Predictive models are a research agenda, not a ready to implement solution.

Anything that can be measured can be predicted, but the inverse is also true. Whatever can’t be measured is necessarily excluded. A model hat is trained to predict based on images recorded by digital cameras, likely learns to predict what images will be recorded by digital cameras – not the underlying reality. If the model believes that the device recording a situation will be hacked to show a different outcome, then the correct prediction for it to make will be that false reading.

There are related concerns that if the model believes it is in a simulation then it can be manipulated by the imagined simulator, in what is known as anthropic capture. However, that’s a bit complicated to go into here, and the threat is not unique to predictive models.

This lack of distinction between representation and reality feeds into the biggest technical issue with predictive models, known as the Eliciting Latent Knowledge problem. The model uses latent knowledge about what is happening (e.g. the recording device got hacked) to make its prediction, but that knowledge is not reflected in the observable output. How can we elicit that information from the model? This is made more challenging by the fact that training a model to explain the predictions requires differentiating between explanations that seem true to a human evaluator and the explanation that the model actually believes.

The titular contribution of the Conditioning Predictive Models paper is an attempt at addressing this problem. Rather than having the model tell us when the observable prediction doesn’t match the underlying reality, we only use it to predict situations where we are confident there will be no differences between the two. This takes the form of conditioning on hypothetical scenarios, like a global moratorium on AI development, before making the prediction. While there is ample room for discussion of this approach, I’m worried it got buried by the need for the paper to establish the background on predictive models.

One issue raised in the paper, as well as in criticisms of it, is that models that predict the real world can’t necessarily predict hypotheticals. Making predictions based on alternate pasts or unlikely events may require some kind of additional training. Regardless of how easy it is to do so, I want to clarify that the need for handling such hypotheticals is a feature of this particular approach, not a requirement for using predictive models in general.

The second major technical issue with predictive models is that optimizing for predictive accuracy is not actually safe for humanity. The act of making a prediction affects the world, which can influence the prediction’s own outcome. For superhuman predictive models with a large space of possible predictions, this influence could be quite large. In addition to the dangers posed by powerful models trying to make the world more predictable, predictions become useless in the default case that they’re not reflectively stable, since they’re inaccurate once they’re made.

This is the problem that I’ve been working on recently! Since the causal pathway for a prediction to affect its own outcome is the response to it, I focus on eliciting predictions conditional on possible responses. This strategy introduces new issues, but I’m making progress on solutions, which I’ll write more about in future posts.

The third challenge is that we don’t know the best ways to actually use predictive models. I laid out some approaches in the previous section, but those are light on implementation details. Which research outputs should we actually predict, how should we integrate predictive models into our decision making, and can we use predictive models to help align more general systems? Are there other ways we should be using predictive models to reduce existential risk? The more details we have planned out ahead of time, the less time we need to waste figuring things out in what may well be critical moments.

The final and perhaps largest risk with predictive models is simply that they are not used. Even if the above issues are solved, the lab that develops the strongest predictive models could instead use them to generate further capabilities advancements or attach scaffolding that transforms them into AGI. The only way around this is if the labs that are closest to creating AGI recognize the potential of predictive models and consciously choose to use them both safely and for safety. As such, the path to success for the predictive models agenda depends not only on technical progress, but also on publicly establishing it as a viable approach to alignment.