Multi-Agent Overoptimization, and Embedded Agent World Models
I think this expands on the points being made in the recently completed Garrabrant / Demski Embedded Agency sequence. It also serves to connect a paper I wrote recently that discusses mostly non-AI risks from multiple agents that expands on the work done last year of Goodhart’s Law back to the deeper questions that MIRI is considering. Lastly, it tries to point out a bit of how all of this connects to some of the other streams of AI safety research.
We don’t know how to make agents contain a complete world model that includes themselves. That’s a hard enough problem, but the problem could get much harder—and in some applications it already has. When multiple agents need to have world models, the discrepancy between the model and reality can have some nasty feedback effects that relate to Goodhart’s law, which I am now referring to more generally as overoptimization failures.
In my recent paper, I discuss the problem when multiple agents interact, using poker as a motivating example. Each poker-playing agent needs to have a (simplified) model of the game in order to play (somewhat) optimally. Reasonable heuristics and Machine Learning already achieve super-human performance in “heads-up” (2-player) poker. But the general case of multi-player poker is a huge game, so the game gets simplified.
This is exactly the case where we can transition just a little bit from the world of easy decision theory, which Abram and Scott point out allows modeling “the agent and the environment as separate units which interact over time through clearly defined i/o channels,” to the world of not embedded agents, but interacting agents. This moves just a little bit in the direction of “we don’t know how to do this.”
This partial transition happens because the agent must have some model of the decision process of the other players in order to play strategically. In that model, agents need to represent what those players will do not only in reaction to the cards, but in reaction to the bets the agent places. To do this optimally, they need a model of the other player’s (perhaps implicit) model of the agent. And building models of other player’s models seems very closely related to work like Andrew Critch’s paper on Lob’s Theorem and Cooperation.
That explains why I claim that building models of complex agents that have models of you that then need models of them, etc. is going to be related to some of the same issues that embedded agents face, even without the need to deal with some of the harder parts of self-knowledge of agents that self-modify.
Game theory “answers” this, but it cheated.
The obvious way to model interaction is with game theory, which makes a couple seemingly-innocuous simplifying assumptions. The problem is that these assumptions are impossible in practice.
The first is that the agents are rational and Bayesian. But as Chris Sims pointed out, there are no real Bayesians. (” Not that there’s something better out there. ”)
• There are fewer than 2 truly Bayesian chess players (probably none). • We know the optimal form of the decision rule when two such players play each other: Either white resigns, black resigns, or they agree on a draw, all before the first move. • But picking which of these three is the right rule requires computations that are not yet complete.
This is (kind of) a point that Abram and Scott made in the sequence in disguise—that world models are always smaller than the agents.
The second assumption is that agents have common knowledge of both agents’ objective functions. (Ben Pace points out how hard that assumption is to realize in practice. And yes, you can avoid this assumption by specifying that they have uncertainty of a defined form, but that just kicks the can down the road—how do you know what distributions to use? What happens if the agent’s true utility is outside the hypothesis space?) If the models of the agents must be small, however, it is possible that they cannot have a complete model of the other agent’s preferences.
It’s a bit of a side-point for the embedded agents discussion, but breaking this second assumption is what allows for a series of overoptimization exploitations explored in the new paper. Some of these, like accidental steering and coordination failures, are worrying for AI-alignment because they pose challenges even for cooperating agents. Others, like adversarial misalignment, input spoofing and filtering, and goal co-option, are only in the adversarial case, but can still matter if we are concerned about subsystem alignment. And the last category, direct hacking, gets into many of the even harder problems of embedded agents.
Embedded agents, exploitation and ending.
As I just noted, one class of issues that embedded agents have that traditional dichotomous agents do not is direct interference. If an agents hacks the software another agent is running on, there are many obvious exploits to worry about. This can’t easily happen with a defined channel. (But to digress, they still do happen in such defined channels. This is because people without security mindset keep building Turing-complete languages into the communication interfaces, instead of doing #LangSec properly.)
But for embedded agents the types of exploitation we need to worry about are even more general. Decision theory with embedded world models is obviously critical for Embedded Agency work, but I think it’s also critical for value alignment, since “goal inference” in practice requires inferring some baseline shared human value system from incoherent groups. (Whether or not the individual agents are incoherent.) This is in many ways a multi-agent cooperation problem—and even if we want to cooperate and share goals, and we already agreed that we should do so, cooperation can fall prey to accidental steering and coordination failures.
Lastly, Paul Christiano’s Iterated Amplification approach, which in part relies on small agents cooperating, seems to need to deal with this even more explicitly. But I’m still thinking about the connections between these problems and the ones his approach takes, and I’ll wait for his sequence to be finished, and time for me to think about it, to comment about this and get more clarity.