Applying Overoptimization to Selection vs. Control (Optimizing and Goodhart Effects—Clarifying Thoughts, Part 3)


Clarifying Thoughts on Optimizing and Goodhart Effects—Part 3

Previous Posts: Re-introducing Selection vs Control for Optimization, What does Optimization Mean, Again? -

Following the previous two posts, I’m going to try to first lay out the way Goodhart’s Law applies in the earlier example of rockets, then try to explain why this differs between selection and control. (Note: Adversarial Goodhart isn’t explored, because we want to keep the setting sufficiently simple.) This sets up the next post, which will discuss Mesa-Optimizers.

Revisting Selection vs. Control Systems

Basically everything in the earlier post that used the example process of rocket design and launching is susceptible to some form of overoptimization, in different ways. Interestingly, there seem to be clear places where different types of overoptimization is important. Before looking at this, I want to revisit the selection-control dichotomy from a new angle.

In a (pure) control system, we cannot sample datapoints without navigating to them. If the agent is an embedded agent, and has sufficient span of control to cause changes in the environment, we cannot necessarily reset and try over. In a selection system, we only sample points in ways that do not affect the larger system. Even when designing a rocket, our very expensive testing has approximately no longer term effects. (We’ll leave space debris from failures aside, but get back to it below.)

This explains why we potentially care about control systems more than selection systems. It also points to why Oracles are supposed to be safer than other AIs—they can’t directly impact anything, so their output is done in a pure selection framework. Of course, if they are sufficiently powerful, and are relied on, the changes made become irreversible, which is why Oracles are not a clear solution to AI safety.

Goodhart in Selection vs. Control Systems

Regressional and Extremal Goodhart are particularly pernicious for selection, and potentially less worrying for control. Regressional Goodhart is always present if we are insufficiently aware of our goals, but in general Causal Goodhart failures seems more critical in control, because it is often narrower. To keep this concrete, I’ll go through the classes of failure, and note how they could occur at each stage of rocket design. To do so, we need to clarify goals at each stage. Our goal in stage 1 is to find a class of designs and paths to optimize. In stage 2, we build, test, and refine a system. In many ways, this stage is intended to circumvent goodhart-failures, but testing does not always address extremal cases, so our design may still fail.

Regressional Goodhart hits us if we have any divergence between our metric and our actual goal. For example, in stages 1 and 2, finding an ideal complex or chaotic path that is dependent on exact positions of planets in a multibody system would be bad, or a path involving excessive G-forces or other dangerous things might be more fuel efficient than a simpler path. For example, a gravitational slingshot around the sun might be cheap, but fry or crush the astronauts. Alternatively, a design with a shape that does not allow people to fit inside might be found when optimizing. Each of these impact goals potentially not included in the model. Regressional goodhart is less common in control for this case, since we kept the mesa-optimizer limited to optimizing a very narrow goal already chosen by the design-optimization.

Extremal Goodhart is always a model failure. It can be because the model is insufficiently accurate, (Model Insufficiency) or because there is a regime change. Regime changes seem particularly challenging in systems that design mesa-optimizers, since I think the mesa-optimization is narrower in some way than the global optimizer (if not, it’s more efficient to have an executing system rather than a mesa-optimizer.)

Causal Goodhart is by default about an irreversible change. In selection systems, it means that our sampling accidentally broke the distribution. For example, we test many rockets, creating enough space debris to make further tests vulnerable to collisions. We wanted the tests to sample from the space, but we accidentally changed the regime while sampling.

In the current discussion, we care about metric-goal divergence because the cost of the divergence is high—typically, once we get there, some irreversible consequence happens, as explained above. This isn’t exclusively true of control systems, as the causal Goodhart example shows, but it’s clearly more common in such systems. Once we’re actually navigating and controlling the system, we don’t have any way to reset to base conditions, and causal changes create regime changes—and if these are unexpected, the control system is suddenly in a position of opitimizing using an irrelevant model.

And this is a critical fact, because as I’ll argue in the next post, mesa-optimizers are control systems of a specific type, and have some new overoptimization failure modes because of that.