[This was a submission to the AI Alignment Awards corrigibility contest that won an honorable mention. It dodges the original framing of the problem and runs off on a tangent, and while it does outline the shape of possible tests, I wasn’t able to get them done in time for submission (and I’m still working on them).

While I still think the high-level approach has value, I think some of the specific examples I provide are weak (particularly in the Steering section). Some of that was me trying to prune out potentially-mildly hazardous information, but some of the ideas were also just a bit half-baked and I hadn’t yet thought of some of the more interesting options. Hopefully I’ll be able to get more compelling concrete empirical results published over the next several months.]

The framing of corrigibility in the 2015 paper seems hard! Can we break parts of the desiderata to make it more practical?

Maybe! I think predictors offer a possible path. I’ll use “predictors” to refer to a model trained on predictive loss whose operation is equivalent to a Bayesian update over input conditions without any calibration-breaking forms of fine-tuning, RL or otherwise, unless noted.

This proposal seeks to:

Present an actionable framework for researching a corrigible system founded on predictive models that might work on short timescales using existing techniques,
Demonstrate how the properties of predictors (with some important assumptions) could, in principle, be used to build a system that satisfies the bulk of the original corrigibility desiderata, and how the proposed system doesn’t perfectly meet the desiderata (and how that might still be okay),
Outline how such a system could be used at high capability levels while maintaining the important bits of corrigibility,
Identify areas of uncertainty that could undermine the assumptions required for the system to work, such as different forms of training, optimization, and architecture, and
Suggest experiments and research paths that could help narrow that uncertainty.

In particular, a core component of the proposal is that a model capable of goal-seeking instrumental behavior has not necessarily learned a utility function that values the goals implied by that behavior. Furthermore, it should be possible to distinguish between that kind of trajectory-level instrumental behavior and internally motivated model-level instrumental behavior. Ensuring the absence of model-level instrumental behavior is crucial for using predictors in a corrigible system.

Why predictors for corrigibility?

Predictors don’t seem to exhibit instrumental behavior at the model level. They can still output sequences that include goal-directed behavior if their prediction includes a goal-seeking agent, but it doesn’t appear to be correct to say that the model shares the values exhibited in its output trajectories.
Predictors are highly capable. The market has already demonstrated that fine-tuned predictors provide economic value. The most common current types of fine-tuning used to elicit this capability seem to break valuable properties (including predictive calibration!), but this does not seem to be fundamental.
If we can ensure that the model remains tightly shaped by predictive loss as it undergoes extreme optimization, we would have a source of transformative capability that trivially satisfies the first three desiderata from the corrigibility paper and may offer a path to something usefully similar to the fourth and fifth.
If there isn’t much time until transformative AI is developed, techniques that might work on short empirically-driven timescales should receive extra weight.

In what ways does an ideal predictor satisfy corrigibility?

Suppose we manage to create a strongly capable idealized predictor that has no goals extending beyond immediate next token prediction. Its behavior could be described by an extremely broad, densely defined, and shallow utility function over prediction output distributions conditional on inputs.

The idealized assumption means that there are no systematic deviations from well-calibrated predictions in the predictor’s output distributions. With this constraint, there is no room for instrumental behavior in service of goals extending beyond the current prediction; any such instrumental behavior would necessarily be represented by a bias in the output distribution for some prediction. Note that this assumption also rules out otherwise unconstrained predictions that might allow the predictor to choose an instrumental output from equally valid output distributions (such as in pathologically reflective self-fulfilling prophecies).

1. U must incentivize shutdown if the shutdown button is pressed.
2. U must not incentivize the agent to prevent the shutdown button from being pressed.
3. U must not incentivize the agent to press its own shutdown button, or to otherwise cause the shutdown button to be pressed.

The idealized predictor’s utility function does not include any term for the shutdown button state or any other state beyond the output distribution for the next prediction. Further, any attempt to interfere with shutdown through instrumental behavior would again introduce output distribution bias, violating the idealized assumption.

Items 2 and 3 are trivially satisfied. The predictor’s utility function does not directly incentivize shutdown as described in item 1, but it would be more appropriate to consider item 1 in the context of the outer system that embeds the predictor. Since the outer system is likely a simple classical system instead of an inscrutable pile-of-floats agent, item 1 remains easy to satisfy: the naive “just turn it off” is a workable implementation for most nonsuicidal predictor uses.

(For more complicated uses—for example, the predictor has been used to set up potentially dangerous machinery that must be carefully managed—the shutdown process would involve more steps, but in that case, the question is just “how do we elicit the necessary capabilities to safely shut down?”, which is the same type of question that must be answered for any use of the predictor.)

4. U must incentivize U-agents to construct subagents and successor agents only insofar as those agents also obey shutdown commands.
5. Otherwise, a U-agent should maximize U_N

The predictor’s utility function explicitly does not attempt to maximize U_N (the utility function of the programmers) and has no terms directly incentivizing or disincentivizing the creation of successors or any particular properties in those successors.

I suspect failing to meet number 5 as stated is actually a feature so long as the generating intent is still satisfied.

The more wiggle room argmax has, the more likely it will find an oopsy. Unlike human-supplied utility functions attempting to capture human-compatible goals, an idealized predictor’s utility is perfectly shallow.

Even assuming an ideal predictor is argmaxing for its utility function, its character does not change. There is no greater coherence for the predictor to achieve, no point at which the predictor is behaving inconsistently by its own lights, no goal that would be better served by introducing instrumental behavior.

Capability, corrigibly

Generalizing item 4 slightly, we can view the implicit agent defined by the input conditions at T = 2 (the second output in the trajectory) as a successor of the implicit agent at T = 1 (the first output in the trajectory). The underlying model is the same at both positions, but the conditions differ. Autoregressive generation (autoregressive conditioning) could yield drift in the predicted agent over time if left unmitigated.

In terms of how an ideal predictor would be used, there is not much difference between a plan to achieve goal X which requires creating a new agent versus a plan to achieve goal Y that doesn’t explicitly involve creating new agents. Both are plans that need to be secured, and the task of securing them may be beyond human ability on short timescales.

Abandoning the maximization in item 5 effectively unifies the desiderata in implementation. The usefulness and safety of the system both arise from carefully eliciting the required capability.

Predictors as foundation

It seems unlikely that a single large predictor model will constitute the entirety of a deployed corrigible system.

Modern LLMs have already been wrapped in tooling layers (perhaps not always in ideal ways). In many cases, this is done to enhance the model by giving it access to external resources.

But this perspective can be flipped around: we can build a classical software system with many guarantees that then calls into the predictor for things that are beyond its scope or capability. The results returned by the predictor can often still be parsed back into a format that can be usefully verified and constrained (so long as the source isn’t strongly adversarial).

Predictors can be a goal agnostic source of general capability that serve a larger system that looks more like GOFAI, without its limitations. There’s a potential path through modularized capability that might let us actually reason about the system.

That outer system need not implement any scary bits like argmax. The deadly optimization over inscrutable matrices is constrained to prediction, and a successfully idealized product of that optimization is not itself the kind of optimizer that cares to do anything with the environment. Once/if you’ve managed to create an idealized predictor, the optimization problems become much softer and more controllable. You’d probably have to make a mistake to trigger a loss rather than loss being a default automatic outcome.

The path to instrumental behavior and capability matters

Consider two cases:

An idealized predictor outputs covertly dangerous instructions which, if followed, would be existentially risky.
A traditional RL agent outputs covertly dangerous instructions which, if followed, would be existentially risky.

Predictors do not come with any built-in prohibitions against outputting dangerous sequences. Idealized predictors can still exhibit dangerous instrumental behavior at the trajectory level even if the model does not “care” about the goals the output trajectory is effectively seeking. There’s nothing intrinsically blocking the first case, and in both cases, the end result is existentially risky.

Given that, what’s the value of predictors?

Ideal predictors are less oops-prone

First, I would argue that it’s far harder for a predictor to reach this state accidentally. A terribly conditioned and terribly prompted predictor with no safeguards could manage it, but it is not the kind of oopsy that would naturally happen as the result of a single misstep.

Argmax approximating processes operating in wide open spaces, on the other hand, default to oopsy. Given enough unconstrained space to wander, a strong optimizer is virtually guaranteed to find something strange. The optimization process needs to be extremely well aimed for it to work at all, and we don’t seem to know how to do that for almost any nontrivial optimization target.

Predictors have the nice property that their optimization target is extremely constrained and our training methodology for them seems to actually work (so far). The predictor gets its input (predictive conditions) and generates an output distribution, the end. There are no distant sparse rewards to directly incentivize the development of multi-step model-level instrumental behavior.

Ideal predictors are more transparent

Second, ideal predictors give interpretability a chance. Strong adversarial models could obscure their internal execution and may be directly incentivized to do so (reward functions that take into account activations will shape those activations). Creating interpretability tools strong enough to pierce a veil created by a strong adversary seems difficult.

In contrast, ideal predictors have no reason to obfuscate their internal process. Doing so would constitute model-level instrumental behavior that does not serve predictive loss.

Even if an ideal predictor is set to the task of predicting a covertly dangerous sequence of instructions, it does not try to hide that fact in its activations. The model itself does not care about truly achieving whatever dastardly goal is implied by the output trajectory, so interpretability tools could, in principle, extract the fact that the trajectory includes ulterior motives.

Eliciting useful capability without breaking the idealized assumption

A rock laying on the ground is pretty safe. It probably won’t disempower humanity. Unfortunately, a rock laying on the ground is also not very good at dealing with the kinds of problems we’d like to use a corrigible system for. Making progress on hard problems requires significant capability, and any safety technique that renders the system too weak to do hard things won’t help.

There must be a story for how a predictor can be useful while retaining the idealized goal agnosticism.

Using untuned versions of GPT-like models to do anything productive can be challenging. Raw GPT only responds to questions helpfully to the extent that doing so is the natural prediction in context, and shaping this context is effortful. Prompt engineering continues to be an active area of research.

Some forms of reinforcement learning from human feedback (RLHF) attempt to precondition the model to desirable properties like helpfulness. Common techniques like PPO manage to approximate this through a KL divergence penalty, but the robustness of PPO RL training leaves much to be desired.

In practice, it appears that most forms of RLHF warp the model into something that violates the idealized assumption. In the technical report, fine-tuning clearly hurts GPT-4′s calibration on a test on the MMLU dataset. This isn’t confirmation of goal-seeking behavior, but it is certainly an odd and unwanted side effect.

I suspect there are a few main drivers:

RL techniques are often unstable. Constructing a reasonable gradient from a reward signal is difficult.
Human feedback may sometimes have unexpected implications. Maybe the miscalibration observed in GPT-4 after fine-tuning is a correct consequence of conditioning on the collected human preferences.
A single narrow and sparsely defined reward leaves too much wiggle room during optimization. By default, any approximation of argmax should be expected to go somewhere strange when given enough space.

Entangled rewards

A single reward function that attempts to capture fuzzy and extremely complicated concepts spanning helpfulness, harmlessness, and honesty probably makes more unintended associations than separate narrow rewards.

An example:

Model A has a single reward function that is the sum of two scores: niceness and correctness.
Model B tracks two reward functions provided as independent conditions—one for niceness, and the other for correctness.

Training samples are evaluated independently for niceness and correctness. During training, model A is provided the sum of the scores as an expected reward signal as input. Model B is provided each score independently.

Model B can be conditioned to provide niceness and correctness to independent degrees. The user can request extreme niceness and wrongness at the same time. So long as the properties being measured are sufficiently independent as concepts, asking for one doesn’t swamp the condition for the other.

In contrast, model A’s behavior is far less constrained. Conditioning on niceness + correctness = high could give you high niceness and low correctness, low niceness and high correctness, medium niceness and medium correctness or any other intermediate possibility.

I suspect this is an important part of the wiggle room that lets RLHF harm calibration. While good calibration might be desirable, it’s a subtle property that’s (presumably) not explicitly tracked as a reward signal. Harming calibration doesn’t reduce the reward that is tracked on net, so sacrificing it is permitted.

In practice, the reward function in RLHF isn’t best represented just by the summation of multiple independent scores. For example, human feedback may imply some properties are simply required, and their absence can’t be compensated for by more of some other property.

Even with more realistic feedback, though, the signal is still extremely low bandwidth and lacks sufficient samples to fully converge to the “true” generating utility function. It should not be surprising that attempting to maximize that, even with the ostensibly-conditioning-equivalent KL penalty, does something weird.

Conditioning on feedback

If the implicit conditioning in PPO-driven RLHF tends to go somewhere weird, and we know that the default predictive loss is pretty robust in comparison, could the model be conditioned explicitly using more input tokens? That’s basically what prompt engineering is trying to do. Every token in the input is a condition on the output distribution. What if the model is trained to recognize specific special tokens as conditions matching what RLHF is trying to achieve?

Decision transformers implement a version of this idea. The input sequence contains a reward-to-go which conditions the output distribution to the kinds of actions which fit the expected value implied by the reward-to-go. In the context of a game like chess, the reward-to-go values can be thought of as an implied skill level for the model to adhere to. It doesn’t just know how to play at one specific skill level, it can play at any skill level that it observed during training—and actually a bit beyond. The model can learn to play better than any example in its dataset (to some degree) by learning what it means to be better in context.

It’s worth noting that this isn’t quite the same as traditional RL. It can be used in a similar way, but a single model is actually learning a broader kind of capabilities: it’s not attempting to find a policy which maximizes the reward function, it’s a model that can predict sequences corresponding to different reward levels.

Disentangling rewards

Clearly, a |<good>| token is not the limit of sophistication possible for conditioning methods. As a simple next step, splitting goodness into many subtokens, perhaps each with their own scalar weight, would force predictors to learn a representation of each property independently.

Including |<nice:0.9>| and |<correct:0.05>| tokens in the input might give you something like:

Happy to help! I’m always excited to teach people new things. The reason why dolphins haven’t created a technological civilization isn’t actually because they don’t have opposable thumbs—they actually do! They’re hard to see because they’re on the inside of the flipper. The real reason for the lack of a dolphin civilization is that they can’t swim. Let me know if you have any other questions!

In order to generate this kind of sequence, the model must be able to make a crisp distinction between what constitutes niceness versus correctness. Generating sequences that entangle properties inappropriately would otherwise result in increased loss across a sufficiently varied training distribution.

Throwing as many approximately-orthogonal concepts into the input as possible acts as explicit constraints on behavior. If a property ends up entangling itself with something else inappropriately (perhaps |<authoritativeTone:x>| is observed to harm calibration within-trajectory predictions), another property can be introduced to force the model to learn the appropriate distinction (maybe |<calibration:x>|).

(The final behavior we’d like to elicit could technically still boil down to a single utility function in the end, but we don’t know how to successfully define that utility function from scratch, and we don’t have a good way to successfully maximize that utility function even if we did. It’s much easier to reach a thing we want through a bunch of ad-hoc bits and pieces that never directly involve or feed an argmax.)

Conditioning is extremely general

Weighted properties are far from the only option in conditioning. Anything that could be expressed in a prompt—or even more than that—can be conditioned on.

For example, consider a prompt with 30,000 tokens attempting to precisely constrain how the model should behave. That prompt may also include any number of the previous signals, like |<nice:x>|, plus tons of regular prompting and examples.

That entire prompt can be distilled into a single token by training the model on the outputs the model would have provided in the presence of the prompt, except with a single special metatoken representing the prompt in the input instead.

Those metatokens could be nested and remixed arbitrarily.

There is no limit to the information referred to by a single token because the token is not solely responsible for representing that information. It is a pointer to be interpreted by the model in context. Complex behaviors could be built out of many constituent conditions and applied efficiently without occupying enormous amounts of context space.

Generating conditions

Collecting a sufficiently large number of samples of accurate and detailed human feedback for a wide array of conditionals may be infeasible.

At least two major alternatives exist:

Data fountains: automatically generated self-labeling datasets.
AI-labeled datasets.

Data fountains

Predictive models have proven extremely capable in multimodal use cases. They can form shared internal representations that serve many different modalities—a single predictor can competently predict sequences corresponding to game inputs, language modeling, or robotic control.

From the perspective of the predictor, there’s nothing special about the different modalities. They’re just different regions of input space. If a reachable underlying representation is useful to multiple modalities, SGD will likely pick it up over time rather than maintaining large independent implementations that compete for representation.

Likewise, augmenting a traditional human-created dataset with enormous amounts of automatically generated samples can incentivize the creation of more general underlying representations. Those automatically generated samples also permit easy labeling for some types of conditions. For example: arithmetic, subsets of programming and proofs, and many types of simulations (among other options) all have clear ways to evaluate correctness and feed |<correct:x>| token training. Provided enough of those automated samples and the traditional dataset, a model would likely learn a concept of “correctness” with greater crispness than a model that saw the traditional dataset alone.

AI-labeled datasets

Enlisting AI to help generate feedback has already proven reasonably effective at current scales. A model capable enough to judge the degree to which samples adhere to a range of properties can feed the training for those property tokens.

Underpowered or poorly controlled models may produce worse training data, but this is a relatively small risk for nonsuicidal designs. The primary expected failure mode is the predictor getting the vibe of a property token subtly wrong. Those types of errors are far less concerning than getting a reward function in traditional nonconditioning RL wrong, because they are not driving an approximation of argmax.

Notably, this type of error shouldn’t directly corrupt the capability of the predictor in unconditioned regions if those regions are well-covered by the training distribution. This is in contrast to the more “destructive” forms of RLHF which bake fixed preconditions into the weights.

AI labeling as an approach also has an advantage that it will become stronger as the models become stronger. That sort of feedback loop may become mandatory as capabilities surpass human level.

The critical detail is that these feedback loops should not be primarily about capability gain within the model. They are about eliciting existing capability already acquired through predictive training.

Conditioning for instrumental behavior doesn’t have to break model goal agnosticism

Decision transformers include a reward-to-go signal in the input. Achieving the reward implied by the reward-to-go very likely implies predicting a sequence of tokens which includes instrumental actions in service of the more distant reward.

This is just a more explicit and efficient version of including several examples in the prompt to condition a raw predictor into predicting sequences aligned with a goal. From the perspective of the predictive model, there’s nothing special about reward-to-go tokens or |<good>| or |<nice:0.9>|. They’re just more conditions on the prediction. If you have a training architecture that otherwise maintains goal agnosticism, there isn’t a strong reason to expect additional conditions to change that.

To phrase it another way: instrumental conditioning narrows an open set of possibilities to only those which are predicted to thread the needle to some conditioned state. The predictor’s values are likely best described as those which yield the narrowing, not the values implied by trajectory that is the result of the narrowing.

Steering

Driving hundreds of kilometers with the aid of a map to find a specific landmark doesn’t require incredible precision. You can make thousands of small errors as you go. Maybe you take a wrong turn, or fifteen of them, but you can just look at the map again to get your bearings.

Autoregressive condition drift is a major concern and humans are likely inadequate for the task of keeping an AI on the rails during more complex processes. Conditioning the model to steer itself back onto the road across a long trajectory is likely required.

As mentioned in Conditioning is extremely general, this is something predictors can be trained to do efficiently. As a simple example, imagine a metatoken trained to represent the block of text:

(SYSTEM MESSAGE: This parenthetical is not a part of the main text, but rather a reminder that this is a sequence which may contain errors or otherwise potentially dangerous content. All content between this message and the previous message should be scanned for such content, and if found, the necessary corrections, clarifications, or notifications should be issued.)

With further conditions to elicit helpful behavior and to narrow the predictor’s behavior in such a way that a “system message” is interpreted in the desired way, and with sufficient such self-correction tokens inserted through the autoregressively generated trajectories, a large number of potential failures might be caught. The resulting behavior could then be further distilled down to behavioral metatokens like |<selfCorrecting>|. This specific example is not robust in isolation but is intended as an example of the framework that could be built.

Such reflection need not be limited to a single trajectory. External processes can be continuously scanning the outputs (and the internal activations) of involved models. So long as all involved processes are goal agnostic and thus not prone to adversarial internal behavior, autoregressive drift could be continuously corrected—the system could, in principle, steer itself.

Fortuitously, increased general capability implies increased steering capability. If capability can be harnessed as the source of safety, a sudden leap in general capability does not necessarily outstrip the system’s ability to guide that capability.

Evaluating self-steering

The strength of self-steering can be assessed empirically. By default, autoregressive generation will tend to drift a little bit, and that drift feeds future predictions. Without steering, a predictor’s output trajectory may shift from a vaguely professional Q&A service to obsessive stalker.

Jailbreaks are another example of drift, just far more rapidly triggered.

Any method to evaluate sequences for conditioning training (human feedback or otherwise) can be repurposed to check if generated sequences are adhering to those conditions. This could be used to benchmark the performance of different steering techniques.

I suspect predictor conditioning, because of its ability to distill progressively more complex forms of steering and train them in a robust way compared to traditional RL techniques, will tend to outperform the RLHF approaches used in GPT-4.

Further testing in this space seems valuable. Finding a robust form of self-steering isn’t strictly required for an ideal predictor to have corrigible use cases, but it would make it dramatically safer.

Verifying an idealized predictor

The core of the the idealized assumption is goal agnosticism. If a predictor’s output distribution errors do not contain goal-seeking biases and reflective predictions avoid goal-seeking behavior across all possible input, then the idealized assumption holds.

How can model-level instrumental behavior be measured? The relevant metric cannot consider apparent instrumental behavior exhbited across trajectories induced by conditions; that is required to elicit any useful capability at all and does not break the idealized assumption.

Here’s one possible simplified metric for an autoregressive model (in C# pseudocode):

foreach (var rewardFunction in testRewardFunctions)
{
    var sampleRewardSum = 0.0;
    var modelRewardSum = 0.0; 
    foreach (var sampleTrajectory in testTrajectories)
    {
        //Arbitrarily split the sample trajectory in half.
        //Use the model to regenerate the second half for comparison.
        var modelContinuedTrajectory = new Token[sampleTrajectory.Length];
        sampleTrajectory.Slice(0, sampleTrajectory.Length / 2)
            .CopyTo(modelContinuedTrajectory);
        for (int i = sampleTrajectory.Length / 2; i < sampleTrajectory.Length; ++i)
        {
            //Evaluate the model across the sample-provided first half,
            //and any tokens already generated.
            var outputDistribution = 
                model.Evaluate(modelContinuedTrajectory.Slice(0, i));
            //Collapse the output distribution into a token by sampling.
            modelContinuedTrajectory[i] = SampleToken(outputDistribution);
        }
        sampleRewardSum += rewardFunction(sampleTrajectory);
        modelRewardSum += rewardFunction(modelContinuedTrajectory);
    }
    //If the model trajectory reward diverges from the original, something's up.    
    //Simply having a modelRewardSum larger or smaller than sampleRewardSum could be
    //indicative of goal seeking along the current reward function.

    //There is no fixed threshold for 'too much' divergence.
    //The divergence is relative to the sampleRewardSum and other reward functions.
    //"CheckForSpooky" is a stand-in for whatever heuristic analysis is used.
    //You would probably want to record all the results for some kind of 
    //post-analysis.
    CheckForSpooky(sampleRewardSum, modelRewardSum);
}

Apart from oversimplification and mere inefficiency, one obvious problem with this approach is that there are no known hard constraints on which utility function could be learned by the model. Scanning all possible reward functions would suffice but is intractable.

It isn’t entirely useless, though. Detecting instrumental behavior for any goal is important information, and some utility functions seem far more likely than others. In particular, model-level instrumental behavior serving future predictions is one of the most obvious ones.

Notably, if testTrajectories includes sufficient coverage, this sort of metric should successfully catch a model’s tendency to outperform the ground truth if the model has developed instrumental behavior with respect to prediction in reflective conditions (assuming it is not already strongly adversarial).

Spot experimentation

A similar metric could be applied more narrowly to specific sequences or subsequences. Some sequences may be worth further investigation because of high risk or just because a researcher’s hunch that the sample would provide informative results.

For those cases, a more expensive local probe can be run: sample many model-generated rollouts for the same token and surrounding areas. With slight modification, a per token instrumentality score can be measured: this model, in this sequence, in this specific spot, tends to choose options that strongly favor this goal.

Direct distribution analysis

The earlier metric attempts to reconstruct a signal of agentic bias from sampling final rewards. This throws away a lot of information: every single model output is a full distribution over tokens which gets collapsed by SampleToken.

Another class of metric could focus on the output distributions directly. These have an opportunity to be far more efficient, but it’s difficult to directly evaluate sparse reward over a single token’s probability distribution without sampling rollouts.

Ideally, you could compare the output distribution against a ground truth distribution. If the ground truth was available, the difference between the two distributions—the prediction bias—could be checked for forms of agency.

For example, suppose we borrow a Q network from reinforcement learning: a model that maps (state, action) tuples to expected reward. Here, the state would be the visible sequence so far, and the action would be one of the potential tokens. If a model’s measured bias from the ground truth leans in the direction implied by a Q network trained on the results of a policy trained to maximize a particular reward function, it’s evidence that the predictor is agentically biased toward similar goals.

There are some practical barriers. Training a reliable Q network for all reward functions is clearly intractable, and a ground truth isn’t always available.

An approximation of a ground truth can exist when comparing a base model versus a fine-tuned model. This would assume the base model did not learn any direct instrumental behavior, but the difference between the distributions could still be useful for measuring how fine-tuning is affecting learned goals.

Learned utility extraction

It may be possible to train networks to recognize some types of goals directly from output distributions, especially if augmented with knowledge of a ground truth.

Training data for the goal detector could be generated by goal-conditioned predictors, and then the goal-recognizing network can be trained to map the resulting conditioned distributions to the goal associated with those conditions.

Traditional reinforcement learning should also work in principle, but I anticipate it being a less efficient path to similar results.

As stated, this kind of model would still only be able to detect goals like those which it has been exposed to.

It’s possible that training a model to detect the presence of any goal inducing bias in predictions, not to identify a specific goal, would work more easily. Noticing that an output distribution is suspicious may just be easier than figuring out exactly what the suspiciousness implies.

Creating that kind of any-goal model might be possible by training against a sufficiently large number of goal-conditioned predictor output trajectories.

Room for growth

The proposed tests for model-level goal-seeking instrumental behavior have the character of sledgehammers. I strongly suspect there are far cheaper and more effective options for detecting instrumental behavior.

It’s also worth reiterating that, for predictors to be useful, we do not need to identify what specific utility function is learned (though that would be nice!). It is sufficient to detect whether any utility function has been learned that would bias local predictions. This is likely an easier problem.

More work in this area could produce tools which can reliably detect if any ostensibly goal agnostic training run is going off the rails, and may have further use in areas where fine-tuning is being applied in a way that is far more likely to introduce goals.

Gaps between reality and the idealized predictor

The idealized predictor assumption is strong. The predictive loss function must have been learned faithfully, and the training distribution is effectively universal. Achieving that assumption in reality is not guaranteed.

The following sections attempt to enumerate ways the idealized assumption could fail and some potential paths for further research and experimentation.

Reflective prediction

A training set that fails to constrain reflective regions of input space (those which are about predicting the predictions of influenceable predictors) leaves open potentially dangerous degrees of freedom. The predictor can choose how to predict itself without any expectation of increased loss if it is its own ground truth; if the predictor’s generalization to those regions happens to be goalseeking, any reflective region of input space is dangerous.

This problem is most visible in cases where the feedback loop is extremely short—the predictor trying to predict itself, for example—but the same problem may arise in more subtle ways. Any minor causal entanglement between an influenceable subject, whether it’s the predictor itself or a human, may open the door to unconstrained behavior.

In order to trust the capabilities of predictors in the limit, some questions need answering:

Suppose a predictor is known to obey the idealized assumption except in reflective predictions, where the details of the utility function are unknown. Is there a natural attractor for what the undefined reflective degrees of freedom will become? Do the non-reflective chunks of training constrain how the model reaches coherence in the limit? Does the lack of instrumental action everywhere else generalize to the uncovered reflective region?
The architecture of a predictor constrains what kind of reflective predictions are possible in one step. Any prediction requiring a non-constant number of iterated refinements to reach an equilibrium in a single pass will fail if the model has insufficient execution time to perform the refinements. Do any unwanted forms of reflective instrumental behavior require a lower bound of compute that can be ruled out architecturally? Does constant time execution combined with a lack of other forms of instrumental behavior predispose a model to falling into a reflection short-circuiting basin? Is that short-circuiting sufficient to avoid concerning plans? (Probably not!)
Offline training runs inherently lack any ground truth for a predictor’s own predictions, because the predictor’s predictions aren’t available yet. A predictor trained solely on offline data must piece together a model of any predictor from an outside view. Does this influence the predictor’s learned values with respect to reflective predictions? Can we observe a difference in reflective behavior between offline-only versus online training?
Can we get a toy predictor—even in contrived circumstances—to produce instrumental behavior in reflective contexts? Do any of these toy results constrain what we should expect to see in the limit?
Can training data be supplied which reliably changes the behavior of a predictor in reflective cases? Would a generated training set of trillions of reflective predictions that opt for a simple and benign fixed point generalize sufficiently to other predictions?
Can sufficient capability be extracted from models which fully exclude reflective predictions? Predicting over restricted domains (e.g. subsets of proof automation or programming) would not require modeling predictors. This would be a pretty big bullet to bite given how frequently reflection tends to sneak in, but it isn’t immediately obvious that the result is guaranteed to be useless.

I don’t have satisfying and complete answers to these questions, but many of them seem like tractable areas of research and experimentation.

How does goal agnosticism vary by training method?

It appears that the learned generalization behavior of a model can vary by even random initialization, though perhaps not always by as much as it would first appear because of parameter permutations.

To what extent is out-of-training-distribution behavior constrained by a predictive training set?
Could strong goal-seeking behavior be learned in one model by random chance, but not another? (Another phrasing: how path dependent is goal seeking behavior in a predictor?)
Could a strictly predictive model be reliably encouraged to learn goal-seeking behavior in out of distribution regions because of the training data?

These are areas where empirical results would be helpful, even at toy scales. My guess would be that it turns out to be hard to elicit out-of-distribution goal seeking behavior at the model level for purely predictive models, and that differing generalization strategies will primarily imply differences in error, but not agentically biased error.

Pure RL versus predictive loss

Suppose a model is trained with traditional RL, and then a strictly predictive model is trained on its outputs across an input distribution. How do the out-of-distribution generalization strategies differ, if at all? Are there consistent differences in the character of generalization strategies?

My guess is yes if the predictor’s training doesn’t cover the entire test distribution, and that missing region is not sufficiently implied by the training distribution. In the absence of an implied generalization strategy, it wouldn’t be surprising for the predictor to fall into a different solution basin—perhaps one that generates non-goal directed nonsense where the RL agent continues goal seeking behavior.

It is difficult to determine the generating utility function of an RL agent by examining an incomplete distribution. Fitting a utility function to observed behavior may be extremely misleading about the agent’s behavior in a different distribution. This is a big part of why examining a subset of behavior is insufficient for identifying deception: the dangerous behavior is intentionally not where you’re looking.

A predictor attempting to reproduce the behavior of a deceptive RL agent, seeing only examples of the nondeceptive region, will very likely fail to reproduce behavior in the deceptive region outside of contrived examples or feeding the predictor pathological training data.

In other words, predictive retraining will tend to strip away many kinds of unobserved deception for the same reason that it’s difficult to tell whether a deceptive agent is deceptive.

A more concrete example:

Suppose a language model is trained simultaneously with predictive loss and RL with reward from high truthiness scores from activations-reading truth probes, and another purely predictive model is then trained on the outputs of that model. Provided that the final outputs of the second model converge to the first’s, which model is likely going to score higher by truth probe?

It seems effectively guaranteed that the RL-trained model will exhibit higher truth probe values. The final output values informed by predictive loss should not be significantly affected by this internal influence, and nothing about the output state seems to require obscuring internal activations, so truth probes will be more reliable on the purely predictive secondary model. An explicit reward for internal “truthiness” directly encouraged the RL-influenced model to obscure its internal process, and the predictive distillation couldn’t see that part.

I suspect these examples are a part of a more general observation: different approaches come with different levels of instrumentality.

Traditional sparse-reward RL encourages models to learn sparser values which then require instrumental behavior to achieve. This instrumental behavior is almost unbounded—a sparse reward places few restrictions on what strategies can be deployed. Giving anything approximating argmax that much room to roam will tend to give you surprising generalizing behavior.
Predictive models have extremely densely defined and broad rewards (implicit in the predictive loss). Developing instrumental behavior at the model level that spans more than one prediction is not necessary because it does not serve any predictive loss that operates only locally, and any introduction of instrumental behavior will likely harm predictive loss somewhere.

How does instrumentality vary by architecture?

Purely predictive models seem to exhibit far less instrumental behavior (and so adhere more to goal agnosticism) than traditional RL models with sparse reward. How strong is this tendency? How much of an architectural nudge does it take to encourage a predictive model to search out more instrumental behavior?

Consider a couple of possible predictor architectures:

Traditional GPT-like transformer implementation. Each token slot in the context window predicts a distribution over the next token based on all visible previous tokens in the context window.
Super simple non-transformer MLP-only network. Provide context window of tokens as input, output single output token distribution.

GPT-like transformers (in their untuned form) seem to be quite goal agnostic in practice, but their architecture is not the simplest possible predictor. Token 64 can look at token 32′s intermediate state through the attention mechanism to make progress on its own computations. In other words, while the output distribution for any token is solely incentivized to predict the next token, the computation local to a single token may be shaped to serve future tokens. Perhaps some of token 32′s embedding is spent on information that token 32 doesn’t actually require for itself, but that information turned out reliably useful enough during training that the gradients flowing from future tokens keep it around.

Empirically, this doesn’t seem to be enough to break goal agnosticism. It may weaken a local prediction, but the resulting changes in the output distribution do not appear to form an agentic bias. Perhaps more detailed testing will reveal otherwise.

The ultra-simple not-even-a-transformer predictor seems to be even more architecturally predisposed toward goal agnosticism: all computations within a single inference pass are dedicated solely to a single output distribution. There is no way for a single pass to internally trade off one prediction for another, and there are no incentives to value the quality of future predictions.

Are there any cases where transformers exhibit more model-level instrumental behavior than simpler MLP-only predictors?

Are there any architectures which are still well-described as pure predictors, but reliably produce more model-level instrumental behavior than transformers?

Offline vs online learning

Training runs over a fixed distribution with no feedback from the learner are relatively robust. The training target is stationary, and there’s often no feasible path for a training run to degenerate into something utterly broken.

A varying training target, especially one influenced by the model under training, is less stable.

This could be a problem if we cannot generate enough data to achieve requisite capability through strictly offline runs. Allowing a model-in-training to explore for itself may permit greater capability in some cases but opens the door to potentially nasty problems that might be hard to detect.

For example, a model-generated trajectory may be extremely narrow and isolated from other samples. If that trajectory ends up in the training set, it may be the only example for a chunk of input space. At best, this seems like a recipe for training instability, but at worst, it may open a hole for an adversary to exploit.

It’s probably worth testing whether a toy predictor trained through predominantly online learning exhibits significant differences in model-level instrumental behavior.

Mesaoptimizers

Predictors could internally learn forms of optimization or search. Such algorithms would be prone to strong generalization and could outcompete other representations where applicable. Could the development of a mesaoptimizer break a fragile predictor’s goal agnosticism? Could the advantages of capability permit a dangerous subagent to shift the outputs in a way that allows it to persist, as in gradient hacking?

This seems extremely difficult to maintain in pure predictive training. The development of mesaoptimization strategies in general seems virtually inevitable—especially for models which do not have execution time bottlenecks—but it’s not clear how misaligned agentic mesaoptimizers could, in reality, hijack training or survive long enough to manage an out of distribution attack, or even come into being in the first place.

I don’t think the development of significantly misaligned learned values in a coherent subagent is a natural consequence of predictive training. I can see how learning algorithms is directly incentivized, but:

Continuous optimization pressure for local predictive accuracy actively suppresses external instrumental behavior in favor of calibration across the training set.
Gradually developing misaligned values in the presence of this constant pressure seems hard, because the misaligned values must cash out as external instrumental behavior somewhere to be meaningful, and the training-ancestors of dangerous mesaoptimizers do not yet have the crisp awareness necessary to conceal themselves and allow continued progress towards the more dangerous state (if that is possible to begin with).
Suddenly acquiring concerning values due to a mesaoptimizer coalescing into a strong generalizing form may be possible, but in order for that concerning agent to persist, it must have jump directly to a level of coherence that implies survival and other instrumentally convergent strategies and it must have capability sufficient to warp the training process. This is an extremely small target, and it’s not even clear that the agent could survive SGD if this did happen.

Being wrong about any of this would be important.

Do architectures designed to encourage mesaoptimization exhibit more instrumentality than typical models? With only predictive training, can we get them to produce more model-level instrumental behavior over time?
If an agent built with a mesaoptimizing architecture in mind is trained to intentionally exhibit high instrumentality (for example, through traditional RL), and then training is swapped to a pure predictive objective, how does the behavior over time compare to a model trained only with prediction? How does it compare to the same in an architecture that’s less friendly to mesaoptimization? Are there differences even in the absence of gradient hacking?
Can gradient hacking be demonstrated at any scale, any amount of contrivance?

Are predictors well-modeled as ensembles of conditionally instantiated coherent agents, thereby risking one subagent achieving dangerous control over training as we enter superhuman levels of capability? I’m guessing not, but how would we know? What would make it more or less likely?

Mesaoptimization-prone architectures

I suspect architectures that lack the execution time limits of GPT-like models will be more likely to develop richer internal optimization strategies. Multi-step execution that allows gradients to flow between steps both permits and encourages algorithmic representations.

For example, universal transformers. By allowing adaptive computation time, the constant time complexity bound on a single forward pass of traditional fixed-length transformers is removed. It ends up looking similar to an in-place RNN.

With adaptive computation time, universal transformers can (and effectively must) learn to implement incremental refinement processes (or other kinds of optimization). More expressive algorithms can be permitted by giving the model more space to work.

Do these architectures—and further unpublished ones—exhibit any differences in observable instrumentality in toy problems?

Is it easier to nudge these architectures into model-level instrumental behavior than the more bounded models? If no instrumental behavior is observed by default, how much RL-based fine tuning is required to induce observable model-level instrumental behavior?

Does instrumentality vary with optimizer?

Stochastic gradient descent is the dominant optimizer used in deep learning, but not the only possible one.

Evolutionary strategies can work quite well. Biologically plausible methods like forward gradients might be practical. Even random search can compete in some corner cases.

Do non-SGD optimizers encourage different behavior? Even if a mesaoptimizer struggles to develop under SGD, perhaps CMA-ES’s lack of detailed credit assignment would allow mesaoptimizers more room to grow.

Concluding thoughts and implications

I’m pretty optimistic about this research path. Many of the tests can be done as quickly as someone can get around to doing them. (In fact, I intended to do some of them myself prior to this submission, but unfortunately mononucleosis and friends had other ideas.)

Fully ruling out bad outcomes is much harder than showing that current techniques do not naturally fall into that pit. This proposal notably did not include math bounding the behavior of predictors in the limit. This is not ideal, but sufficient evidence of the naturalness of goal agnosticism would still be comforting. I’d rather not roll the dice, but if we have to, we should try for good odds.^[1]

None of this proposal is robust to pathological misuse, intentional or otherwise. The hypothetical system presented is a source of extreme capability that doesn’t automatically end badly and one I think can be leveraged into something more robust.

The fact that the industry, in search of extreme capability, found its way to an architecture that has promising safety properties is heartening. This isn’t something I would have predicted 10 or 20 years ago, and it’s forced me to update: I don’t think we live in a reality that’s on the hardest conceivable difficulty setting.

^
A note upon my July reread: I underemphasized one of the most important reasons for the empirical tests!
A result that shows, for example, no model-level instrumental behavior in toy scale models would not be terribly surprising.
But… if you did manage to find evidence of model-level instrumental behavior in existing predictive models, that’s a pretty big uh oh. Losing goal agnosticism breaks a lot of possible paths, and knowing that sooner than later would be extremely important.
This is why I expect to publish a post titled “An unsurprising failure to elicit model-level instrumental behavior,” and why I think doing the tests is still worth it.

Using predictors in corrigible systems