It Can't Be Mesa-Optimizers All The Way Down (Or Else It Can't Be Long-Term Supercoherence?)

Epistemic status: After a couple hours of arguing with myself, this still feels potentially important, but my thoughts are pretty raw here.

Hello LessWrong! I’m an undergraduate student studying at the University of Wisconsin-Madison, and part of the new Wisconsin AI Safety Initiative. This will be my first “idea” post here, though I’ve lurked on the forum on and off for close to half a decade by now. I’d ask you to be gentle, but I think I’d rather know how I’m wrong! I’d also like to thank my friend Ben Hayum for going over my first draft and WAISI more broadly for creating a space where I’m finally pursuing these ideas in a more serious capacity. Of course, I’m not speaking for anyone but myself here.

With that said, let’s begin!

The Claim

I think there are—among others—two pretty strong, concerning, and believable claims often made about the risks inherent in AGI.

First, optimizers searching for models that do well at some goal/function/etc. might come across a model which has an internal optimization process which seeks to optimize towards some other, distinct goal. This is, of course, entirely plausible: humans exist, we can fairly represent them as optimizers, they derived from natural selection which can be fairly represented as an optimizer, and their “goals” do not perfectly align. These are mesa-optimizers.

Second, the idea that the values/goals/utility function of sufficiently strong optimizers (i.e, a superintelligent AGI) will tend to cohere. In other words, stronger optimizers will tend towards more strongly optimizing for only one non-contradictory utility function (and they will do this via expected utility maximization or something like it, but I don’t think this post depends on that). This is coherence.

Maybe putting these two claims right next to each other makes what I’m about to suggest very obvious, and maybe it’s already something that has been discussed. However, I’m making this post because it wasn’t obvious to me until I thought it up and these two discussions don’t seem to come up alongside one another very often. While there have been a number of posts for and against the idea of coherence theorems, I haven’t seen anything making this observation explicitly.

To make the claim clear: If sufficiently strong optimizers tend towards optimizing for a coherent utility function, that implies they will tend towards optimization processes that do not spawn adversarial mesa-optimizers as a part of its methodology. At least, it can’t be letting those mesa-optimizers win. Otherwise, those mesa-optimizers would use resources from the top-level process to optimize for something else, which would invalidate the premise of coherent utility seeking.

Initial Thoughts

Since this is a very fresh idea to my mind, I’ve written some initial thoughts about how someone might argue against this claim.

My first concern would be that I’m using these abstractions poorly, but perhaps that provides an opportunity to clear up some confusion in the discussion of coherence and mesa-optimization. For instance: We could keep the initial two claims independent from each other by claiming that any mesa-optimizer spawned by an optimization process is no longer a “part” of that process. After all, coherence claims are usually formulated in reference to agents with more dominant strategies outcompeting other agents, so throwing in extra claims about internal dynamics may be disruptive or out-of-scope. However, I still think we have some information to be gained here.

Specifically, consider two possibilities:

The original optimizer “loses” or gets fooled by the new mesa-optimizer, eventually causing the overall system to go haywire and optimize for the mesa’s “goal” over the original “goal”. Or the future lightcone carries more of the mesa utility versus the original utility, or however you want to put it.
The mesa-optimizer(s) spawn(s), but never gets the opportunity to win over the top-level optimizer. This could be further branched into situations where the two reach an equilibrium, or the mesa-optimizer eventually disappears, but the point is the top-level optimization function still gets fulfilled to a significant degree “as intended”.

The problem: In scenario 1, the mesa-optimizer is still, by definition, an optimizer, so any general claims about optimization apply to it as well. That means if any optimization process can spawn adversarial mesa-optimizers, so can this one, and then we can return to the two scenarios above.

Either we eventually reach a type of optimizer where scenario 2 occurs indefinitely, and no new mesa-optimizer wins, or there is always a chance for scenario 1 to occur, and long-term supercoherence is inherently unstable. Adversarial mesa-optimizers could spawn within the system at any moment (or perhaps a set of important moments), providing an eternal shifting battle of what to optimize for even though every individual optimizer is “coherent” or close to it.

Another point of contention could be on whether mesa-optimizers occur within the same types of processes as the agents referenced by coherence theorems. One way this could be true is if coherent agents perform nothing that could be classified as internal optimization, but I find this difficult to accept. Even if one did, surely such entities could be outcompeted by processes that did have internal optimization in the real, complex world?

The final concern I have is “does this matter?” This is something I’m still uncertain about, but because I’m not certain it is not important, I still think this idea is worth sharing. One possibility is that any optimization process strong enough to resist any mesa-optimizers is too complex for humans to design or realize before it is too late, and none of the optimization methods we have available to us will have this property (and be usable). The more common issue I keep finding myself coming back to is that this claim may be redundant, in the sense that maybe the optimization processes that win against mesa-optimizers necessarily depend on solving alignment, thus allowing them to create new trustworthy optimizers at will. In other words, this could just be a rephrasing of the argument that AIs will also have to solve alignment (though, certainly one I’ve never seen!). For example, if mesa-optimization is primarily driven by a top-level optimizer being unable to score its models such that the best appearing model is the one that is best for it across all future scenarios, solving this problem seems to require knowing with certainty that a particular process will generalize to your metric off the training distribution perfectly–and that just sounds like alignment. However, by the same token, this may also imply that coherent optimization by an agent isn’t possible without its internals having something to say about the alignment problem. I also started thinking about this a few days after reading this post on how difficult gradient hacking seems. Perhaps such a process is not so far away?

Naturally, even if true, this doesn’t immediately solve everything (or much of anything). Still, I think we can agree it could be pretty great if we were able to find a process we could trust to not continuously generate adversarial optimizers. If this line of reasoning is correct, one of its outcomes reads like an existence proof of such a process–though it might depend on alignment being possible. The other suggests coherence is never going to be perfect. Both of these worlds feel slightly more manageable to me than having to deal with both.

Thanks for reading, and I’d appreciate your feedback!

It Can’t Be Mesa-Optimizers All The Way Down (Or Else It Can’t Be Long-Term Supercoherence?)

The Claim

Initial Thoughts