Thanks in no small part to Goodhart’s curse, there are broad issues with getting safe/aligned output from AI designed like “we’ve given you some function $f (x)$ , now work on maximizing it as best you can”.

Part of the failure mode is that when you optimize for highly scoring $x$ , you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of $x$ that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on.

Here’s what I’d like to discuss in this post:

When unstructured maximization does/doesn’t work out for the humans
CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme
It’s not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than “maximize this proxy”
Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart

When does maximization work?

In cases when it just works to maximize, there will be a structural reason that our model connecting ” $x$ scores highly” to ” $x$ is good” didn’t break down. Some of the usual reasons are:

Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there’s less room for it to be broken.
Examples: theorem proving, compression / minimizing reconstruction error.
The space we’re optimizing over is not open-ended. Constrained spaces leave less room for weird choices of $x$ to break the correspondences we were relying on.
Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks.
The optimization power being applied is limited. We can know our optimization probably won’t invent some $x$ that breaks our model if we know what kinds of search it is performing, and can see that these reliably don’t seek things that could break our model.
Examples: quantilization, GPT-4 tasked to write good documentation.
The metric $f$ is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment $f$ in tune with goodness.
Examples: chess engine evaluations, having $f$ evaluate the thoughts that lead to $x$ .

There’s a lot to go into about when and whether these reasons start breaking down, and what happens then. I’m leaving that outside the scope of this post.

Passing the buck on optimization

Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn’t a satisfactory solution for getting good outcomes out of strong optimizers.

Take CIRL for instance, or perhaps more broadly the paradigm: “the AI maximizes an uncertain utility function, which it learns about from earmarked human actions”. This design has something going for it in terms of corrigibility! When a human tries to turn it off, there’s scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off.

But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are “higher-utility” plans than accepting shutdown:

If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It’s learning something from you trying to press the off switch, but not what you wanted.
It might seem better to stay online and watch longer in order to learn more about the utility function.
Maybe there’s a plan that rates highly on “utility” that works by subtly usurping the feedback mechanism.

(You might also like to read this recent post on CIRL (in)corrigibility, which is well-formalized and has great explorations of the thought process of a CIRL agent.)

These scenarios depend on a bunch of particulars. But the bottom line is that we’re still leaving open cases where the thing being maximized comes apart from our good outcome. The good outcome in this case is the operators switching the AI off^[1].

I claim that a large part of the problem was that this AI design merely passed the buck one step, relying on an inner process of maximization that is itself incorrigible. Then if the glue connecting that maximization to good things weakens, we’re in no place to fix it.

Patches help but do not evade the difficulty

One patch idea is to replace “maximize” with “quantilize”: e.g. the AI samples from plans that look like “reasonable human plans”^[2] and chooses one from the top 1%. This will help stay in the region of actions/plans that we can model, and make it easier to stay in control of the process.

This is a good and helpful change! But it only matters to the extent that it replaces some of the labor being done by “maximize this proxy” with something safer.

Suppose you were trying to use your AI to do something ambitious, e.g. implement a monitoring scheme that can flag GPU code that implements large-scale LLM training. (NB: this is not a great example of something ambitious that would help with AI x-risk! But it can help illustrate the difficulty.)

Developing the ability to do this monitoring will require a bunch of thinking and work. Suppose one of the pieces you need is an AI module that understands code quite well, so you can use it to check if some code implements the necessary operations to train an LLM. How does your quantilizing CIRL-ish system achieve this?

If the answer is “it writes code for some machine-learning-ish maximization processes that—”, then you’re running into the same issues again. You have pushed some of the maximization around, without factoring it out. You may be in a better place than a complete black box, since you can offload some labor to your CIRL bot. But you haven’t named a solution that “tiles” and can achieve complicated goals without doing open-ended maximization inside.

And if you think “write code to detect LLM training” isn’t too scary an open-ended task, I observe that “solve AI alignment for us” is a lot more dangerous^[3]. See also Rob Bensinger on the danger with AI being in high-powered plans.

The road to corrigibility

Here is a proto-plan for corrigibility: we try to build into the AI some way of tracking the fact that its optimization is only desirable when it stays within our model. And generally that it’s important to only do optimization when it’s desired.

Backing up, my intuition says that the off-switch situation should play out like this:

The human makes to press the off switch.
The AI only “wants” itself and its processes to be running when the human has a good model of them, and is choosing based on that to run the AI.
The AI infers this is no longer the case, since otherwise the human wouldn’t be trying to press the off switch.
The AI switches off.

This is far from a coherent vision, since I don’t have the details for how the AI could “want” that! Still though, one key property of my visualization is some “awareness” in the AI system that all of its optimization can run afoul of Goodhart’s curse and other problems, and is thereby tentative. It should prioritize optimization in ways that are safer, perhaps using any of the ideas we have for mild optimization or similar.

And really, when the humans are trying to shut it down, this should result in an immediate update against continuing to run itself!^[4] There definitely shouldn’t be a complicated reasoning process that might for-all-we-know decide to resist.

In short, my basic hope is that we can offload some of the tricky work of “watching the AI to make sure it’s working like we expect” to the AI, and my specific hope is that the AI can do simple things that steer away from complicated model-violating things. The implementation remains super hard and ill-specified.

Related corrigibility ideas

I broadly think of other angles on corrigibility as trying to stay in the regime of “we know what the optimization is doing”, and “we don’t expect it to surprise us by breaking our models”.

Two related ideas are “reward actions not outcomes” and “supervise processes not outcomes”. The point being that it helps to shape your AI with opinions about its plans and how it does its planning, rather than only constraining downstream outputs.

And another proto-plan for optimizing things better is Davidad’s “open agency” sketch. (See also this comment.) I’m no expert, but my take on it is:

We build enough world model / ontology that’s legible to us, while also being legible to the AI. (Perhaps ideally in a formal language so we can prove things about what the models admit. Less ideally, a bunch of text files that describe the rules of our models, that we can check plans against, or formalize later.)
We use all that to point to what we want in a lot of detail, and constrain parts of the optimization to work within ontologies we trust.

I’ll finally point also to Eliezer’s recent list of ideas (though beware, there are fiction spoilers in the surrounding text). My thoughts looking for “corrigibility-generating principles” are in large part inspired by these.

^
There’s a case to be made that you could have an AI system correctly know that it would be bad for the human’s values to comply with shutting off. But for our first AI systems, who aren’t moral patients and might be written to optimize the subtly wrong thing, it seems a design failure for there to be forseeable circumstances in which they won’t shut down. Human input is roughly the only glue connecting things-the-AI-wants and things-humanity-wants, so it’s playing with fire to design AIs that can oppose our steering.
^
We could do this by training a generative model of “plans” on human data, or be choosing amongst a list of human-approved meta-plans, or something similar.
^
We have some people trying to solve the alignment problem. I haven’t seen them say their research is bottlenecked in a way some sort of AI research buddy could fix.
Maybe AI theorem provers / formalization helpers can speed things up. Maybe there are subproblems that need some sort of shallow brainstorming + pruning. These are the sorts of things I can imagine non-scary AI systems helping with.
But the best human work on alignment, which isn’t yet looking like it solves the problem, involves smart folk thinking very deeply about things, inventing and discarding novel factorizations of the problem in search of angles that work.
This involves a lot of open-ended thinking, goal-directed loops, etc. It’s not the sort of thing I can see how to replace or substantially-augment with non-scary AI systems.
^
Also, to be clear, you shouldn’t have your AI be thinking about the off-switch in the first place! That’s not its job and shouldn’t be its concern. The off-switch should just work.
Maybe you should get a compartment of the AI to help with making sure the off-switch continues to work, and that the humans aren’t impeded in using it. That seems a bit more reasonable.
Generally speaking the causal connection between the off-switch and the AI being off should be kept as simple and robust as you can make it. Then you need to design your AI so that it won’t mess with that.

Thinking about maximization and corrigibility

When does maximization work?

Passing the buck on optimization

Patches help but do not evade the difficulty

The road to corrigibility

Related corrigibility ideas