Towards a solution to the alignment problem via objective detection and evaluation
Thanks to Arun Jose, Joseph Bloom, and Evan Hubinger for discussions/feedback.
This work was primarily carried out during SERI MATS under Evan Hubinger’s mentorship.
We explore whether the ability to detect and evaluate the objectives of advanced AI systems in training and in deployment is sufficient to solve the alignment problem. We mostly ignore practical/implementation considerations in order to examine this approach in an idealized setting before we try to make concrete progress in this direction. This allows us to highlight weaknesses with this approach and further considerations that any practical implementation must contend with. We note that taking this idealized framing hides some messy issues that we intend to discuss in future posts.
We note that this approach is not novel (see this post or the related work section of this post) but provides the motivation for our future research and some additional considerations.
Our basic analysis indicates that we can solve the alignment problem in this idealized setting if we can:
Detect all objectives/intentions of any system produced during the training process as well as any objectives that may develop in deployment.
Evaluate whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversibly bad.
Shut down a system if we detect that the set of objectives it’s pursuing will lead to irreversibly bad outcomes before it can accomplish those outcomes.
High-level summary of this post
Automating certain tasks will require us to build/train AI systems that pursue objectives (optimizing systems).
We are concerned with unintentionally training optimizing systems that pursue (catastrophically) bad objectives during training or in deployment.
Assumption: Optimizing systems accomplish some outcomes by having an “internalized objective” that is causally upstream of that outcome, i.e., in a counterfactual setting where this internalized objective didn’t exist within the internals of the system, the outcome wouldn’t be realized in the environment.
Assumption: We assume that outcomes such as successful deception require the optimizing system to have a corresponding objective.
We claim that if an overseer can accurately detect and evaluate all of the objectives of optimizing systems produced during the training process and during deployment, before the outcomes associated to the objectives are realized, the overseer can prevent bad outcomes caused by optimizing systems pursuing bad objectives. This includes deceptive behavior and the optimizing system preventing itself from being turned off.
We argue that robustly detecting an optimizing system’s objectives requires strong interpretability tools.
We discuss the problem of evaluating objectives and some of the difficulties involved.
We will avoid trying to define what we mean by an objective for now and instead work with an intuitive notion. In future work, we will attempt to clarify this intuition into a definition.
An alignment problem for “optimizing systems” and a theoretical solution
Optimizing systems and objectives
We begin by presenting the key intuitions and assumptions that inspired this approach. We doubt that these intuitions/assumptions hold as stated but leave a more thorough analysis for future work.
The first intuition is that for a cognitive system to be able to achieve certain “low entropy” outcomes in an environment (building a factory, successfully running a business, producing novel scientific discoveries, etc.), the system will likely need to be capable of using strategy/planning/problem-solving to select action sequences that lead to that outcome. We call such systems optimizing systems.
The second intuition states that for an optimizing system to produce such a desired outcome, it must first have an “internal objective” that corresponds to the desired outcome. It must then apply computation/cognition (possibly in the form of planning or general problem-solving) that “uses” this objective to achieve this outcome. So there exists a causal relationship between the objective/intention of the system and the outcome that the objective references. In particular, there exists a time gap between the system first developing the objective and the outcome produced by the system.
We also make the assumption that outcomes that involve the overseer no longer being in control of the optimizing system (due to successful deception, etc.) requires the optimizing system to have an objective that corresponds to that outcome.
These intuitions suggest that it is theoretically possible for an overseer to make certain observations of the internals of an optimizing system (in particular, observations of the system’s objective) that allow it to deduce that the system is “trying” to produce a particular outcome in the environment. If the outcome that the objective corresponds to is deemed bad by the overseer, the overseer could intervene and stop the system before the outcome is realized.
Again, for the purposes of this post, we don’t attempt to define exactly what we mean by an objective and rely on an intuitive notion. We hope to clarify this term in an upcoming post. Other terms we could have potentially worked with include goals, values, intentional systems, consequentialist systems, and mesa-optimizers.
Note that we don’t assume that an optimizing system has fixed objectives. Its objectives may change during runtime/deployment (we discuss this briefly in the next section).
The alignment problem for optimizing systems
Building/training optimizing systems is potentially dangerous if, in the process of building/deploying a powerful optimizing system, the training process succeeds in installing powerful capabilities in the system, but the optimizing system develops and successfully pursues bad objectives.
We define the overseer to be the system that builds/trains the optimizing system. We say that an objective of an optimizing system is good/bad (with respect to the overseer) if the outcome that corresponds to that objective is considered good/bad from the overseer’s perspective. We call an objective “irreversibly bad” if the resulting outcome is an existential catastrophe, the removal of the overseer’s capacity to shut down the optimizing system, deceptive behavior, etc. Broadly speaking, we’re referring to objectives that result in the overseer “losing control of the optimizing system.”
At this point, it seems prudent to note that in practice, an overseer may not be able to accurately evaluate whether an objective is good or (irreversibly) bad. For now, we assume that the overseer can accurately evaluate outcomes and hence evaluate the corresponding objectives that point to those outcomes. We will touch upon this issue later.
The big problem: We currently don’t know how to install good objective(s)/avoid installing (irreversibly) bad objectives in optimizing systems via prosaic training methods, and it seems likely that we can/will train powerful optimizing systems via prosaic training methods. There are conceptual arguments that suggest that installing the intended objectives is difficult and empirical results that demonstrate the failure of prosaic training methods to install the intended objectives in existing toy optimizing systems.
To make matters worse, we may build and deploy optimizing systems whose objectives can change during deployment, intentionally or otherwise. Such optimizing systems may be necessary for good performance during deployment in complex environments with diverse tasks. This could be because the optimizing system may need to learn new objectives that depend on the deployment environment (for example, it may need to learn instrumental objectives). It could also be the case that we might want the objective(s) that the optimizing system is pursuing to be updated in the presence of new data via feedback. It could also be the case that objectives haven’t been solidified during training and so may change as the system interacts with the environment. Hence, we may need methods to ensure that optimizing systems continually have good objectives, even if they may change during deployment.
Statement of the alignment problem
We need a method of training/building optimizing systems such that:
(a) The optimizing systems produced by the training process never successfully pursue an irreversibly bad objective during training or deployment.
(b) The optimizing systems produced by the training process pursue good objectives and “rarely” pursue bad objectives.
If the above two properties hold, we say that the optimizing system is aligned with the overseer.
We briefly give some justification for choosing this framing. We’d like our optimizing systems to pursue good objectives and do useful things during deployment. However, we allow some slack regarding the pursuit of bad objectives because we may need to collect observations of optimizing systems pursuing bad objectives so that we can apply the relevant selection pressure, or because it may be sufficiently difficult to get guarantees that our system will never pursue bad objectives during deployment. However, we need our optimizing systems to never successfully pursue irreversibly bad objectives such as removing the overseer’s capacity to shut the system down, etc. If an optimizing system never successfully pursues an irreversibly bad objective, then the overseer is always in the position to shut down and retrain the optimizing system.
We note that solving the alignment problem, as stated, does not necessarily prevent bad outcomes from AI; it ensures that we don’t produce systems that produce bad outcomes (from the perspective of the overseer) that result from the system pursuing objectives that correspond to those outcomes.
Objective detection and evaluation as a theoretical solution to the alignment problem
We’ll spell out the solution which follows naturally from our framing of the problem.
Claim: If the overseer has the following capabilities, part (a) of the alignment problem is solved:
(Detection): The overseer can detect all of the objectives of any optimizing system produced by the training process (including in deployment if the objectives change) before the outcome that the objectives corresponds to is realized in the world.
(Evaluation): Given a set of objectives that the overseer has detected in an optimizing system, the overseer can evaluate whether the optimizing system pursuing this set of objectives will lead to good/bad/irreversibly bad outcomes.
(Control): The overseer can shut down any optimizing system produced by the training process, even in deployment. In particular, if the overseer detects and evaluates an objective as being irreversibly bad, then the overseer can shut down the optimizing system before it can successfully pursue that irreversibly bad objective.
With these capabilities, the overseer can shut down and modify an optimizing system that develops an irreversibly bad objective and so part (a) of the alignment problem (as stated) is solved. In particular, because we’ve defined an irreversibly bad objective to include removing the overseer’s control capability, the overseer will retain this ability unless it loses this capability for other reasons (such as losing the off-switch).
To solve part (b) of the alignment problem the overseer also needs to train/select/build optimizing systems that pursue good objectives and “rarely” pursue bad objectives during deployment.
This can be achieved with either of the two following capabilities (based on the training/building method being used):
(Objective Writing): The overseer can directly write objectives into optimizing systems and delete any objectives it detects.
(Objective Selection Pressure): The overseer can apply selection pressure to select for systems that pursue good objectives and don’t pursue (irreversibly) bad objectives during training.
Objective Writing allows the overseer to ensure that optimizing systems are always pursuing good objectives and furthermore, it can delete (irreversibly) bad objectives whenever they’re detected because part (a) holds.
Objective Selection Pressure holding ensures that the overseer can (theoretically) select for optimizing systems that pursue good objectives and never pursue (irreversibly) bad objectives on the training distribution. Furthermore, part (a) ensures that at any point in deployment, the optimizing system can be shut down and retrained so that the new system has had additional selection pressure to pursue good objectives on the distribution sampled from deployment.
The above capabilities allow us to solve the alignment problem under the constraint that the overseer must constantly be monitoring the optimizing systems it deploys, and it must retrain them as necessary.
The following additional capability could allow the overseer to eventually take it easy and would allow us to drop the Control capability from the overseer:
(Future Objective Guarantees): The overseer can verify whether or not a given optimizing system will pursue (irreversibly) bad objectives during deployment/will continue to pursue its current good terminal objectives.
Note that eventually, we will need Future Objective Guarantees if we are to confidently build a Sovereign AI that can independently optimize for aligned objectives without human oversight.
In this section, we consider some aspects of implementing this approach in practice, including the role of interpretability in this approach and difficulties with evaluating objectives.
The role of interpretability
To go from the above theoretical solution to a practical implementation, we need to develop tools that give an overseer the aforementioned capabilities.
We claim that for us to have confidence that our solution holds, we’ll likely need interpretability tools and can’t rely on using external observations of systems produced during the training processes.
When training systems via prosaic training methods, we encounter the problem where we select systems that look good to the overseer but may not actually be good. This is because prosaic training methods use selection pressures based on the overseer’s observations of the system during training and the overseer’s evaluations of those observations. Therefore, we need the overseer to be able to make observations that can truly distinguish between good systems and bad-but-good-looking systems.
We claim that relying on observing the external behavior of systems may not be sufficient to make such distinctions. This is because under a fixed set of inputs to two optimizing systems with different objectives, it’s possible that the two systems may produce the same external behavior.
To overcome this, we might try to select the input set in an adversarial manner in order to produce inputs that cause an optimizing system with one objective to behave differently from an optimizing system with a different objective.
We should do this; however, in the worst case, we can never be truly confident that we’ve selected a good system due to the possibility of deception. A sufficiently capable deceptive systems can produce outputs that look good to the overseer during training due its situational awareness. In the worst-case scenario, we should think of the information coming from the system’s external behavior as corrupted due to the possibility of deception.
Hence, when implementing this solution under prosaic training setups, we can’t confidently rely on detecting and evaluating an optimizing system’s objectives based on observations of its external behavior during training.
Certain observations of an optimizing system’s internals might allow an overseer to truly differentiate between good and good-looking-but-bad optimizing systems. We earlier alluded to the idea that objectives might have to be internally represented in the system before the outcome of that objective can occur. If this is the case then, theoretically, interpretability tools could be developed that allow the overseer to directly detect (and possibly help evaluate) the internally represented objectives.
One might object that a sufficiently capable deceptive system could make the internal representations of its objectives opaque to interpretability tools. This could indeed be the case, which is why our solution requires us to be able to use interpretability tools to detect deceptive objectives and shut down the system before the obscuration can be achieved.
It is currently unclear how detectable or interpretable objectives within the optimizing systems we care about will be.
Thoughts on evaluating objectives
Detecting all objectives in an optimizing system is already a big ask. Even if the overseer could detect all of the objectives in an optimizing system, it might be difficult to accurately predict whether a powerful optimizing system pursuing those objectives would result in good outcomes or not.
This might be difficult for a few different reasons. One reason is that the objectives might be represented in an alien way such that we can’t simply “read off” the corresponding outcome, or set of outcomes. Another difficulty is that the information about the outcome contained in the objective may not be enough for the overseer to properly evaluate all the relevant aspects of that outcome.
That being said, with enough of an understanding of the optimizing system’s internals, we might be able to directly translate from the internal representation of the objective to a description of the relevant parts of the corresponding outcome. An alternative approach involves the overseer using predictive models that have been trained to map from the objective to representations of the corresponding outcome. These predictive models could use the optimizing system’s internals as inputs, or the overseer might be able to use a predictive model that is used by the optimizing system internally.
A classic problem touched upon above is that we might obtain predicted outcomes that look good but may not actually be good. In the case of human overseers, this could be due to our lack of forecasting abilities or our lack of ability to discern what is good. If the overseer uses a predictive model, the model might give some observations about what the outcome will look like (e.g. in the form of a video feed showing happy humans) but this information may be incomplete and hence might not inform us that the outcome is actually bad (e.g. the outcome features humans who have been forced to look happy). A solution to ELK would help address this problem.
What if we can only evaluate certain objectives?
We could weaken the assumption that the overseer needs to be able to evaluate all objectives that the optimizing systems produced by the training process might pursue. Suppose that we can still detect all objectives of an optimizing system. Then we could ensure that our optimizing systems are pursuing objectives that we can evaluate as good and shut down the optimizing system whenever it begins pursuing an objective that we can’t evaluate. This might still be useful if we can productively use optimizing systems that pursue simple objectives that we can identify and evaluate.
Empirical work/next steps
Making progress on this research agenda involves investigating to what extent we can develop tools that allow us to detect and evaluate objectives in optimizing systems that might show up in future training regimes.
Early work will involve deconfusion work that attempts to help clarify what we mean by an objective and generate hypotheses regarding how objectives might be represented in future optimizing systems. From there, we hope to test these hypotheses on existing and future optimizing systems (toy maze-solving systems or LLM-based systems with objectives).
A benchmark for this agenda involves developing tools that can successfully pass auditing games for detecting and evaluating objectives in optimizing systems. In particular, if some group (the Blue Team) claims that they have developed tools that can detect or evaluate the objectives of a set of optimizing systems, a Red Team will adversarially propose experiments to test/break the Blue Team’s tools. The Blue Team’s tools have passed the auditing game if their tools can successfully stand up to the Red Team’s attack.
In this section, we look at existing work related to the approach outlined in this post.
Retargeting the search
John Wentworth’s Retargeting the search can be thought of as an implementation of this approach under the assumption that optimizing systems have a particular internal structure. This assumption states that optimizing systems have a general-purpose search (or optimization) module that takes in the current objective as an input. The ability to detect the “objective slot” of this module and monitor the objectives being passed into this module would allow us to implement the solution proposed in this post.
Acceptability oversight and transparency tech tree
We can identify this approach as “Worst-case robust-to-training transparency for non-deceptive models” from Evan Hubinger’s post A transparency and interpretability tech tree, where the interpretability targets are “intentions/objectives.” Monitoring objectives that develop during the training process could serve as a form of acceptability predicate that allows us to detect precursors to deception (namely the objective to be deceptive).
Isolating objectives by leveraging their differential tendency to update in certain contexts
Arun’s work demonstrates an attempt at isolated objectives in existing toy models that might be thought of as having objectives. Hopefully future conceptual work will lead to more hypotheses regarding how objectives might be represented in optimizing systems.
We have introduced a theoretical solution to the alignment problem (as stated) and explored a number of considerations regarding how this might be implemented in practice. This approach is based on intuitions that will need to be critiqued and developed in further work.
This solution seems difficult to implement in practice. There are obvious difficulties such as the fact that developing such interpretability tools seems like a big task and we face time constraints. Furthermore, it’s unclear whether the tools we might develop and test on optimizing systems will generalize to future systems.
Pessimism is warranted, but it seems possible that pursuing this direction could lead to useful insights. Therefore we suggest further conceptual and empirical investigation.
Probably should have stuck with the term “mesa-optimizer”.
Powerful agents will likely be able to preserve their objectives during deployment but it’s not clear whether the first dangerous optimizing systems will have this capability.
Note that we assume that training methods will eventually have the ability to train optimizing systems that have the capability of pursuing objectives we care about. Otherwise, we don’t have to worry about bad outcomes from optimizing systems pursuing bad objectives.