This is the third of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper.

3. The case for competitiveness

In addition to ensuring that we can condition predictive models safely, for such an approach to work as a way to actually reduce AI existential risk, we also need it to be the case that it is competitive—that is, that it doesn’t impose too much of an alignment tax. Following “How do we become confident in the safety of a machine learning system?” we’ll distinguish between two different aspects of competitiveness here that we’ll need to address:

Training rationale competitiveness [(Implementation competitiveness)]: how hard the training rationale [(getting the model we want)] is to execute. That is, a proposal should fail on training rationale competitiveness if its training rationale is significantly more difficult to implement—e.g. because of compute or data requirements—than competing alternatives.

Training goal competitiveness [(Performance competitiveness)]: whether, if successfully achieved, the training goal [(the model we want)] would be powerful enough to compete with other AI systems. That is, a proposal should fail on training goal competitiveness if it would be easily outcompeted by other AI systems that might exist in the world.

To make these concepts easier to keep track of absent the full training stories ontology, we’ll call training rationale competitiveness implementation competitiveness, since it describes the difficulty of implementing the proposal, and training goal competitiveness performance competitiveness, since it describes the achievable performance for the resulting model.

Implementation competitiveness

The most generally capable models today, large language models, seem to be well-described as predictive models. That may change, but we think it is also at least quite plausible that the first human-level AGI will be some sort of predictive model, likely similar in structure to current LLMs.

Furthermore, LLM pre-training in particular seems to be where most of the capabilities of the most advanced current models come from: the vast majority of compute spent training large language models is spent in pre-training, not fine-tuning. Additionally, our guess is that the fine-tuning that is done is best modeled as targeting existing capabilities rather than introducing entirely new capabilities.

Assuming that, after pre-training, LLMs are well-understood as predictive models, that suggests two possibilities for how to think about different fine-tuning regimes:

The fine-tuning resulted in a particular conditional of the original pre-trained predictive model.
The fine-tuning targeted the capabilities by turning the predictive model into one that is no longer well-understood as predictive.

In the first case, the conditioning predictive models approach would simply be a variation on the exact techniques currently used at the forefront of capabilities, making it hopefully implementation competitive by default.^[1] The main way we think such an implementation competitiveness argument could fail is if the fine-tuning necessary to get the sort of conditionals we describe here is substantially harder than alternative fine-tuning paradigms.

In particular, we think it is likely the case that our proposed solutions will add some amount of overhead to the implementation and use of predictive models, since they involve being substantially more careful regarding what conditionals are used.^[2] That being said, we’re hopeful that the overhead requirement here should not be too extreme: the conditionals we propose are not fundamentally different than those already in use, and in many cases the conditionals we want—e.g. predict a human researcher—might actually be more straightforward than some of the sorts of conditionals that current RLHF approaches often aim for—e.g. behave as a helpful, harmless, and honest agent.

In the second case, however, the conditioning predictive models approach would require us to either halt or at least substantially change fine-tuning practices—which could certainly hamper the implementation competitiveness of this approach. Thus, it’s worth carefully investigating and understanding when different fine-tuning regimes are likely to fall into which of these categories—especially because, if in fact some fine-tuning results in a model that’s no longer well-understood as a predictive model, none of the ways of ensuring such a model’s safety that we discuss here would still apply.

Currently, the state of the art in language modeling uses reinforcement learning from human feedback (RLHF) as the fine-tuning approach of choice for targeting LLM capabilities, and thus will be the method that we focus on exploring. To the extent that RLHF is just a way to access a wider array of conditionals, this seems fine and can be incorporated into our approach. Currently, however, we don’t understand under what circumstances RLHF can be thought of as conditioning a model versus doing something more complicated. We will go into the specific question of how to think about how our approach interacts with RLHF in more detail in Section 4.

Furthermore, we note that there has been some work on using LLMs to predict world events. Performance of GPT-2-style models is decidedly worse than aggregations of expert humans, but this work serves as a proof of concept that the kind of approach we envision is possible at all.

Beyond fine-tuning, another potential implementation competitiveness issue lies in transparency and interpretability. Later we will discuss using transparency and interpretability to address potential inner alignment issues with training a predictive model—we recognize that this could create an implementation competitiveness burden. However, we do not believe that inner alignment issues that might require transparency and interpretability tools to mitigate are at all specific to this approach. The inner alignment issues that might arise are fully general, and apply to all approaches using LLMs. Thus, any other safe approach using LLMs will also need to use the same transparency and interpretability tooling to mitigate inner alignment issues, meaning that this approach is no less implementation competitive than any other safe approach.

That being said, our hope is that transparency tools are not necessary for inner alignment, and that predictive models have a decent chance of being inner aligned by default. Furthermore, we think that the inner alignment issues for predictive models are potentially substantially easier than those for agents. We will discuss these issues further in Section 4.

Performance competitiveness

In order to ensure performance competitiveness, we have to ensure that our approach is able to accomplish all of the tasks that would otherwise be able to be accomplished via other means of training and deploying powerful AI systems. However, if we operate under the assumption that LLMs are likely to be the first powerful AI systems, then all we need to do is ensure that anything anyone else can do with an LLM we can do via careful conditioning as well: that is, we need to ensure that we aren’t imposing any restrictions on what one can do with an LLM that affect capabilities too much.

Our hope is that our techniques are general enough to be able to safely target an LLM at any task. We of course require that the capability in fact be present in the LLM to begin with such that it could possibly be elicited, but beyond that our hope is that almost any capability that an LLM has can be safely elicited via the sorts of conditioning approaches that we discuss.

Restricting ourselves to only predicting humans

Unfortunately, we absolutely are restricting what things we are allowing one to do with an LLM. In particular, we are restricting ourselves to only predicting (augmented) humans rather than other AI systems.

We think that this is the primary way that the overall performance competitiveness argument goes wrong: we are limited to tasks theoretically performable by some conceivable human or group of humans, since we are explicitly avoiding simulating AIs. As a result, the basic approach described here is not an indefinitely scalable alignment solution, in the sense that its ability to produce aligned models stops working at some high level of model capabilities.

That being said, we think this limitation is not necessarily fatal for our overall approach: it seems very plausible that we could target LLM capabilities up to the level of any task that any group of highly-capable people could accomplish under perfect conditions over some reasonable period of time, say e.g. five years—what we described previously as the “max human” level. Though not arbitrarily powerful, our hope is that the capabilities in the first transformative LLMs that are available to be elicited will be below that level—and that we will be able to then use those capabilities to help us align the next set of models that are above that level.

Thus, even though we can’t safely elicit capabilities just via conditioning beyond the threshold of what any humans could theoretically accomplish, as long as the capabilities that we need to successfully navigate the transition from slightly superhuman/transformative LLMs to highly superhuman AGIs is under that threshold, that fact shouldn’t constitute a fatal performance competitiveness issue.

Restricting ourselves to dialogue agents that claim to be humans

In addition to the capability limitations of predicting humans described above, there is another potential limitation to just doing human prediction: it makes it substantially harder to train AI systems that truthfully represent what they are. In particular, if you ask a predictive model predicting a helpful human assistant whether it’s an AI or a human, it will claim to be a human, which is correct in the sense that claiming to be a human is what a human would do, but incorrect in the sense that the AI system is not itself a human.

Truthfully identifying as an AI is important for helpful AI assistants. Impersonating a human would present a potentially thorny deployment and societal issue, since it could be quite confusing for users to interact with AIs that claim to be humans. A ban on impersonating humans, if this is not clearly recognizable to the user, has also been suggested as part of the proposed EU AI Act. Given that creating AI assistants and chatbots is an increasingly important use of predictive models, it would be an important gap if one could not build safe AI assistants just by predicting humans.

That being said, there are multiple ways to potentially address this sort of limitation. One idea could be to simply predict humans pretending to be AIs—though this does increase the possibility of those humans knowing they’re being predicted, as we discussed previously. Alternatively, another approach could be to use two models, one of which generates the original output that it predicts a human would produce, and another that performs the task of translating that output into the equivalent thing but claiming to be an AI rather than a human—both of which are tasks that humans should be plenty capable of, and thus should be doable just via predicting humans. These patches could backfire, however, since the conditional needed to implement a capable assistant may be strange and unlikely and could thus elicit unsafe AI predictions, though we are hopeful one could develop a safe version of such patches.^[3]

Restricting ourselves to only using predictive models

In addition to only predicting humans, another major restriction of the conditioning predictive models approach is that we use predictive models at all—approaches that instead attempt to e.g. turn an LLM into an agent are clearly ruled out. Thus, we need conditioning predictive models to be performance competitive with such alternative agentic proposals.

Certainly, we will not be able to get performance parity between human predictors and agents in regimes with arbitrary levels of capabilities, due the basic limitations around predicting humans that we just discussed. However, we are optimistic that performance parity up to that point is achievable.

The most basic argument for why agents might be more performance competitive than predictors is that they can make plans and strategies for how to achieve particular goals over the long-term. In the regime where we are only dealing with capabilities below the max human level, however, we can simply get such planning from predicting humans.

However, there is one specific case where human agentic planning might not be sufficient, even in the sub-max-human capability regime: agents might be better than predictors at planning and strategizing with their own internal cognitive resources such as time, attention, etc. Just saying we are going to predict humans is not sufficient to deal with this criticism: if a predictor attempting to predict humans isn’t managing its own internal resources as effectively as an alternative more agentic model, then it’ll have worse performance in practice even though in theory the humans it was predicting would have been able to accomplish the task.

In our opinion, however, we think this sort of problem is far from fatal. To start with, just because a model is attempting to solve a prediction task doesn’t mean it’s incapable of using more agentic planning/optimization machinery to manage its own internal cognitive resources. In fact, we think doing so is probably necessary for all practical competent predictors: if my goal is to predict what will show up on the Alignment Forum tomorrow, I can’t just model the entire world with no internal prioritization—I have to be judicious and careful in modeling only the aspects of it that are directly relevant. Though this does ensure that our predictor will be a mesa-optimizer in some capacity, we think inner aligning such a mesa-optimizer is potentially achievable, as we will discuss in Section 4.^[4]

There is an additional issue here, however, that comes from the prediction loss: doing well on per-token log loss requires modeling a lot of random fiddly details—e.g. what synonyms/formatting/grammar/etc. people will use—that aren’t very important to the sort of performance—e.g. actually accomplishing complex tasks—that we care about. As a result, prediction models might end up devoting a bunch of extra capacity to modeling these sorts of fiddly details in a way that makes them less competitive.

In our opinion, however, this sort of issue is quite addressable. First, it’s worth pointing out that it’s unlikely to be a very large difference in competitiveness: the fiddly details are not as hard to predict as the more complex large-scale phenomena, so as performance gains from fiddly detail prediction get eaten up, relatively more and more capacity should start going to high-level modeling instead. Second, if the model is capable of dynamically allocating its capacity based on what is most relevant to focus on for each conditional, then a lot of this problem can be solved just via selecting the right conditional. In particular, if you use a conditional that makes it very clear what the allowable formatting is, what the sort of voice that is being used is, etc., then predicting things like synonyms and formatting becomes much easier, and the model can devote more relative capacity to predicting more high-level phenomena.

Handling sequential reasoning

Even if the only mechanism ever used for eliciting LLM capabilities is some form of conditioning, performance competitiveness for careful conditioning approaches is not guaranteed, since we are placing restrictions on the sorts of conditionals we’re allowing ourselves to use.

In particular, we have to deal with situations where the only competitive way of eliciting a certain capability is via sequential reasoning—e.g. via chain of thought reasoning,“think step by step” prompts, etc. In our opinion, we think it is highly likely that such sequential reasoning approaches will be necessary to use any predictive model in a competitive way, and thus ensuring that careful conditioning approaches are compatible with such sequential reasoning is quite important.

Under the model that LLMs are well-understood as predictive models, sequential reasoning effectively does two things:

it helps convince the model that it should be predicting a good reasoner rather than a poor one, and
it effectively splits up the overall prediction task into individual, easier prediction tasks, thus giving the model more overall compute to work with.

Notably, neither of these benefits are incompatible with also conditioning the model in such a way so as to, for example, ensure that the reasoner it’s predicting is a human rather than an AI. In fact, doing so seems necessary to make these sorts of sequential reasoning techniques safe: without them, conditioning on a good reasoner just increases the probability that the model will predict a malign AI as AIs become better reasoners than humans, absent some technique to prevent it from doing so.

That being said, there is a valid concern that requiring such sequential reasoning to be done by predicting humans is uncompetitive. Certainly, requiring sequential reasoning to be done by predicting humans will be uncompetitive for eliciting any capabilities beyond the max human level. However, in regimes where no such capabilities are present to be elicited, we are optimistic that purely relying on sub-max-human sequential reasoning should be sufficient.

Unfortunately, there is one major performance competitiveness roadblock here, which is very related to the previous concern regarding managing internal cognitive resources. While we think that a predictive model should be able to efficiently handle managing its own internal cognitive resources, sequential reasoning approaches effectively give it additional external cognitive resources to manage as well, specifically how to use its chain of thought scratchpad. For example, a chain of thought model has to figure out how to decompose questions in such a way that the subproblems are easier for it—the model—to solve than the original question.

Ideally, the goal would be to predict how some human or group of humans would manage those external cognitive resources. There is one major issue with just using human resource management, however, which is that the optimal way of managing external cognitive resources might be very different between humans and AIs—e.g. due to differences in what sorts of subtasks each is relatively better or worse at. As a result, managing external cognitive resources the way that a human would could be substantially worse because it limits the model’s ability to steer the sequential reasoning into subtasks that the model is best at and away from subtasks that the model is worst at.

There are some potential solutions to this problem, though we think it is overall a serious problem. First, there are a lot of possible counterfactual humans that we can draw on here—and humans themselves also vary a great degree regarding what sorts of tasks they’re best at. As a result, we could try to simply condition on predicting a human or group of humans with similar strengths and weaknesses to the model. Unfortunately, this does come with the downside of potentially increasing the model’s credence that the so-called “human” it’s trying to predict is actually an AI. Second, we could try to give the predicted human the explicit task of trying to break down the problem in such a way that plays to the strengths and weaknesses of the AI. Unfortunately, though this doesn’t come with the problem of increasing the model’s credence that the human it is predicting is an AI, it should certainly increase the model’s credence that there are AIs in the world of the human it is predicting, which—as we discussed previously—could be similarly problematic, though potentially less bad.

↩︎
If a new paradigm displaced LLM pretraining + fine-tuning, though, then all bets are off.
↩︎
For example, it is likely to be strictly more difficult to implement an LLM that never predicts other AIs than one that does, as well as strictly more cumbersome to have to restrict an LLM to only ever be interacted with using carefully crafted, safe conditionals.
↩︎
Another potentially tricky issue with any of these approaches of training a predictive model to claim to be a human is that if it’s actually not well-described as a predictive model—and is instead e.g. well-described as an agent—then we would actively be training it to lie in telling us that it’s a human rather than an AI, which could result in a substantially less aligned agent than we would have gotten otherwise. Furthermore, such patches could also lead to an assistant that is more fragile and susceptible to jailbreaking due to the limitation of having to go through predicting some human.
↩︎
Notably, this does also open us up to some additional inner alignment concerns related to doing the internal cognitive resource management in the right way, which we discuss later.

Conditioning Predictive Models: The case for competitiveness