This is the fifth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper.

5. Interactions with other approaches

Imitation learning

One very related approach is imitation learning: rather than try to predict humans, as we are proposing, one could simply train to imitate them instead. Such an approach would have many of the same safety benefits, since it would also be exclusively trying to produce outputs from safe humans.

The basic problem with such an approach, however, is that there’s no reason to believe that a model trained via pure imitation learning would generalize beyond the capability level of the human(s) it is trained to imitate. While using predictive models to predict humans also cannot produce outputs that humans would never be able to generate, it can produce outputs that no humans that it has ever previously seen would be able to generate, since it might e.g. predict that such humans will exist under some conditional.

Thus, we think that predictive modeling at least has the potential to be just as safe as imitation learning while being able to generalize to substantially more advanced capabilities—though, similarly to imitation learning, predicting humans still cannot elicit capabilities beyond those that any conceivable human would be capable of, as we discussed previously.

Supervised fine-tuning

For some conditionals we might have a very precise notion of what we want the model to observe (e.g. “exactly this image coming from this camera”). Ideally, this sort of a conditional should be straightforwardly implementable via prompting, just by fixing the relevant tokens in the model’s context window.^[1] However, at least for current models, prompting has some basic structural limitations—for example, if you want to condition on something very long, context window length could start to become quite problematic. In that sort of a case, it might be quite helpful to instead turn to supervised fine-tuning, fine-tuning on the observation to condition on rather than including it in a prompt. Effectively, this sort of fine-tuning lets you give the model substantially more bits of evidence for it to condition on than is possible via just prompting.

For the most part, we think this is likely to be basically fine, since it’s essentially continuous with pre-training: if we think that pre-training produces the sort of predictive model we want, then including some extra pre-training-style data and fine-tuning on it should do the same. The primary concern here, however, would be situations where the fine-tuning data is for some reason not very continuous with the pre-training data.

One way that the fine-tuning data could be substantively different than pre-training is if it directly depends on the model itself—e.g. fine-tuning on the model’s own outputs. Not only is this substantially less continuous with pre-training, but it also specifically raises the risk of the model imitating AIs and/or producing self-fulfilling prophecies.

Such fine-tuning could also be particularly problematic if the data is specifically selected according to some criterion other than actual representativeness of the world—that is, if there’s no clear “camera” that corresponds to how the data was collected. Probably the most notable way this could happen is via reinforcement learning (RL), specifically the practice of fine-tuning on trajectories selected to have high reward, e.g. as in OpenAI’s FeedME approach. In our view, we think this is likely to be essentially the same as just doing the RL update directly—which we’ll discuss next.

RL fine-tuning

Reinforcement-learning-based fine-tuning approaches—particularly reinforcement learning from human feedback (RLHF)—provide a potentially very flexible way to condition models. In particular, we often want to employ indirect conditionals—e.g. “We observe some sequence satisfying the following conditions: …”—rather than the more direct conditionals we can get via prompting—e.g. “We observe the following sequence: …”. Such indirect conditionals are exactly the sort of thing that RL fine-tuning approaches might be able to provide for us, since we can use the reward to define an acceptance condition and then fine-tune the model to produce output that satisfies that condition.

As we discussed previously, however, the primary problem is that it’s very hard for us to know whether the result of that fine-tuning procedure will actually be well-described as a predictive model or not. Previously, we introduced the RLHF conditioning hypothesis to describe the hypothesis that RLHF is well-modeled as producing a predictive model that is implementing a particular conditional of a pre-trained distribution—rather than it e.g. producing some sort of agent. In this section, we’ll discuss of the concrete factors in RLHF implementations that might affect how likely the RLHF conditioning hypothesis is to be true—e.g. the presence or absence of KL penalties.

Though we’ll primarily be focusing on cases where the desired outcome is for the RLHF conditioning hypothesis to be true—and are worried about situations where it might not hold—the opposite could also be a problem for other RLHF-based approaches. If the intention is to use RLHF to produce some sort of non-predictive model, but the result is always some sort of conditioned predictive model, that could result in substantially different safety properties than expected. As a result, even if one thinks using RLHF to produce a non-predictive model is likely to be safer or more competitive than the sorts of careful conditioning approaches we describe, if the RLHF conditioning hypothesis is true, there might be no such alternative actually available using current techniques, making careful conditioning the only viable path.

Furthermore, even if the RLHF conditioning hypothesis is true, there’s also the issue of actually being able to control which conditional we get from any particular RL fine-tuning process. Say, for example, we fine-tune a predictive model on a “helpfulness” objective. We could get a conditional like “This is a very helpful AI”—but we could also get “This is an AI pretending to be helpful”, “The following text is helpful”, or any other number of variations that are all consistent with good performance on the RL objective. Though explicitly training your model to act as an AI is likely not a good idea—since it leads it to predict what an AI would do, as we’ve discussed extensively previously—this sort of unidentifiability persists essentially regardless of what sort of conditional you’re trying to use RL fine-tuning to access.

KL penalties

One nice fact about RL fine-tuning is demonstrated in “RL with KL penalties is better seen as Bayesian inference.” Korbak et al. demonstrate that, when a KL penalty is used, it approaches a form of variational Bayesian inference as it converges, with the pre-trained model as the prior. More specifically, the result is that the RL + KL objective is equivalent to minimizing the KL distance between the model and the pre-trained model updated on the reward. This is equivalent to a variational approximation of a Bayesian update with the pre-trained model as the prior and the reward specifying the log likelihood. The strength of the update is controlled by the strength of the KL penalty: with no penalty the result is an infinite update (i.e. the observation is “absolute truth”), whereas with strong penalties the update is modest.

If Korbak et al.’s theoretical result holds in practice, it would mean that doing RLHF with a KL penalty would be an effective way of ensuring that the RLHF conditioning hypothesis holds.

Furthermore, there is another reason to use KL penalties in RLHF as well: the kinds of conditionals we usually want RLHF to implement tend to be the sorts of conditionals where we don’t just want to unboundedly maximize the reward. In particular, we might only think the reward makes sense up to a point and don’t want to just maximize the human’s/discriminator’s confidence, since that could be quite adversarial. For example, we often want to do RLHF with objectives of the form “observe something satisfying the following criteria” or “observe a world that has more of property X than ours.” For these sorts of conditionals, it seems especially important to have some sort of KL penalty to prevent the model from just attempting to purely maximize them.

In practice, however, explicit KL penalties often don’t change much, at least relative to stopping early when a particular KL threshold has been reached. Specifically, in “Scaling Laws for Reward Model Overoptimization,” Gao et al. find that explicit KL penalties let you extract more proxy reward for the same KL distance, but don’t get you any additional true reward. That said, this observation is somewhat orthogonal to the RLHF conditioning hypothesis, which is not about how much reward the model gets but rather how it generalizes—and the fact that KL regularized models get more proxy reward implies that they do learn a different policy.

Nevertheless, it does seem that the most likely explanation here is just that it doesn’t matter much whether you do explicit or implicit KL regularization as long as some sort of KL regularization is done. In particular, even when no explicit KL regularization term is used, early stopping based on KL is still a form of implicit KL regularization—and, as Gao et al. point out, the structure of the standard proximal policy optimization (PPO) RL objective also includes a step-wise KL regularization term.

Though we don’t believe that Korbak et al.’s formal mathematical result of convergence to variational inference holds with either of these forms of implicit KL regularization, such limiting results are only suggestive of in-practice behavior anyway, and in practice we think Gao et al. is at least suggestive that the difference between implicit and explicit KL may not matter very much.^[2]

How do rewards correspond to conditionals?

Some reward signals correspond to straightforward conditionals. For example, if the reward is “1 if there is a cat in the camera frame, 0 otherwise” then maximizing reward is equivalent to conditioning on a cat being in the frame.

By contrast, many reward signals do not correspond to a cleanly-interpretable conditional. Consider rewarding a model based on how beautiful humans think the simulated camera images are. The model could learn to aim at the abstract concept of beauty, equivalent to a conditional like “the world is more beautiful than the training distribution suggested.” but it could instead learn specifically what the human rater thinks is beautiful, learn to condition on a sycophantic human that wants the human rater’s approval, or any other number of possible conditionals that produce equivalently high reward.

The problem here is that—even if we assume we always get some conditional—understanding the correspondence between what rewards we select and what conditionals we get could be quite tricky. To be able to have confidence in the safety of a predictive model, we think it’s critical that we understand what world the predictive model is predicting, and so we would be excited to see more work on understanding the resulting correspondence between rewards and conditionals.

Nevertheless, we think there are some things we can do to increase our confidence in terms of what conditional we think we’ll get. In particular, in the Korbak et al. correspondence, rewards correspond to logprobs of the resulting conditional—thus, an obvious thing we can do is explicitly generate our rewards based on the log of an evaluation of a probability. Specifically, we could extract a human estimate for the probability that a given output has some binary property, then explicitly produce our reward signal using the log of that probability. If this isn’t possible, at least using a bounded reward seems useful, since it limits the size of the possible update from the pre-trained distribution in the Korbak et al. correspondence.

Mode collapse

One concrete difference between RL fine-tuned models and pre-trained models that is pretty well-understood is that the former often exhibit mode collapse. That is, once an RL fine-tuned model finds a policy that achieves high reward, it gets stuck, failing to explore other (possibly equally good) policies. For example, a model trained to avoid doing harm might become useless, refusing to help even on innocuous queries (e.g. section 4.4 in Askell+22).

On its own, such a phenomenon isn’t necessarily problematic—but it is a way in which RL fine-tuned models systematically diverge from pre-trained models, and in a way that they shouldn’t diverge in the Korbak et al. limit. Thus, this phenomenon is at least suggestive of RL fine-tuned models potentially having other systematic differences—both from pre-trained models and from the Korbak et al. limit—that might be more problematic.

That being said, there is at least a clear conceptual reason why RLHF models would diverge from the correct limiting behavior here. Theoretically, pre-training loss is proportional to the KL penalty $D_{KL} (H | M)$ where $H$ is the human distribution and $M$ is the AI distribution, which measures the ability of the model to assign a high probability to everything that the human says—but importantly doesn’t guarantee that the human would assign a high probability to everything the model says, except to the extent that the model ends up assigning slightly too low probabilities to the human text because it’s distributing its probability mass elsewhere. As a result, pre-trained models have a very wide variety of outputs, including those that no human would ever say. RLHF loss, on the other hand—if we think of “human approval” as measuring probability on the human distribution—is proportional to the opposite KL penalty $D_{KL} (M | H)$ , which measures whether the human would assign a high probability to everything the model says—but doesn’t guarantee (except in the limit) that the model actually says everything the human would say.

We don’t think it’s particularly necessary for safety research effort to go into the sub-problem of mode collapse, as it’s also an issue for capabilities researchers and we expect the solutions they develop to naturally translate into the safety applications of interest. Furthermore, we think mode collapse might even be a desirable property from a safety perspective—and thus in fact an argument in favor of RLHF—since having a model that spreads its probability mass around enough to often say things that no human would ever say could be a serious problem if you’re trying to get your model to actively predict humans.

Decision transformers

An alternative to standard RL fine-tuning is to train a decision transformer, a model that predicts what reward it will get before producing the output that has that reward, thus allowing high reward trajectories to be sampled via conditioning on a high reward output. Decision transformers can be trained via supervised learning rather than explicit RL updates, which might make them less problematic—but as we discussed previously, even if we’re doing supervised fine-tuning, if it’s not continuous with pre-training there’s no particular reason to believe that doing so preserves the safety properties of the pre-trained model.

However, there is another reason to like decision transformers as well, which is that they give us substantially more control over how we condition on high reward: in particular, we get to condition on exactly the level of reward that we want, rather than just “high reward” in general. This could give us substantially more ability to stick precisely to the capability elicitation frontier, as we discussed previously, rather than accidentally asking the model for more capabilities than it has and thus needlessly exposing us to additional danger for no performance benefit. This usage of decision transformers is effectively treating them as a quantilizer.

That being said, additional control over the level of capabilities that we’re asking for can also be a double-edged sword, as decision transformers can also make it easier to accidentally ask for far more capabilities than are available to be elicited if they are conditioned on rewards much higher than they have ever seen before—as a result, decision transformers more dangerous than normal RL fine-tuning in the hands of an uncareful user.

Imitative amplification

As we discussed when we were considering factorization strategies, any sort of factored cognition approach—such as imitative amplification—is very tricky to do with a predictive model, as such approaches necessarily rely on training the model on its own outputs. There are at least two major issues: it increases the probability that the model will predict AIs rather than humans, and it specifically increases the probability the model will predict itself, leading to multiple fixed points and the possibility of self-fulfilling prophecies.

Despite the inherent difficulties in dealing with fixed-point-like behavior, however, we think it is conceivable that imitative amplification could overcome these difficulties. In particular, if the model ends up attempting to predict what the result of the amplification training procedure will be, in some sense that should be no worse than if it didn’t do that, since the result of the amplification training procedure is exactly what we were going to get anyway. In other words: predicting what will happen after we do amplification should be no worse than just doing amplification. In this view, understanding the safety of a predictive model predicting an amplification training procedure should be no different than understanding the safety of the amplification training procedure to begin with. To the extent that the model is just trying to predict what will happen after training, ideally all that should do is just speed up training, making the approach more competitive and just as safe.

There are a couple of potential challenges to this view, however. First, if the model is predicting anything other than the result of the amplification training procedure, none of this analysis holds. For example, if it starts predicting a future malign superintelligence pretending to be in an amplification process, that could be highly dangerous. Furthermore, once any model at any point in the training process starts predicting a malign superintelligence, it could corrupt the entire iterative procedure, since any other trying to predict the result of the procedure will now include predicting the malign superintelligence.

Second, having models early on in the amplification training procedure predicting what the end result of that procedure will be could change what that end result is relative to if they weren’t doing that. This is precisely how self-fulfilling prophecies can be so dangerous. Theoretically, if the humans doing the decomposition are careful enough to ensure that there are no loops or infinite deferrals such that all subquestions are strict reductions of the original question, then, in theory, the limit of imitative amplification should only have one fixed point. In practice, however, since we only ever go to finite depth, and since humans might not be able to always produce strict decompositions, multiple fixed points could be possible, in which case which one is reached seems essentially up to how the early predictors make their predictions. And if the way those early models choose fixed points is, for example, based on how predictable they make the resulting world, the resulting fixed points could be highly unsafe.

↩︎
Something more sophisticated, such as performing inference over longer trajectories, may be necessary if the relevant conditionals do not fit in the context window. This technical detail does not change the basic story though.
↩︎
It’s also worth pointing out that Gao et al. provide another piece of evidence potentially in favor of the RLHF conditioning hypothesis, which is that model scale seems to mostly change the intercept of the fit for the amount of true reward obtained after RLHF, suggesting that scale primarily operates via improving the baseline prior. If the RLHF conditioning hypothesis holds, pre-training scale operating via improving the baseline prior is exactly what it would predict—that being said, while it’s unclear what other hypotheses regarding what RLHF is doing would have predicted here, it seems quite plausible that they would have predicted the same thing and that this doesn’t actually distinguish much between them.

Conditioning Predictive Models: Interactions with other approaches