What can the principal-agent literature tell us about AI risk?
This work was done collaboratively with Tom Davidson.
Thanks to Paul Christiano, Ben Garfinkel, Daniel Garrett, Robin Hanson, Philip Trammell and Takuro Yamashita for helpful comments and discussion. Errors our own.
The AI alignment problem has similarities with the principal-agent problem studied by economists. In both cases, the problem is: how do we get agents to try to do what we want them to do? Economists have developed a sophisticated understanding of the agency problem and a measure of the cost of failure for the principal, “agency rents”.
If principal-agent models capture relevant aspects of AI risk scenarios, they can be used to assess their plausibility. Robin Hanson has argued that Paul Christiano’s AI risk scenario is essentially an agency problem, and therefore that it implies extremely high agency rents. Hanson believes that the principal-agent literature (PAL) provides strong evidence against rents being this high.
In this post, we consider whether PAL provides evidence against Christiano’s scenario and the original Bostrom/Yudkowsky scenario. We also examine whether the extensions to the agency framework could be used to gain insight into AI risk, and consider some general difficulties in applying PAL to AI risk.
PAL isn’t in tension with Christiano’s scenario because his scenario doesn’t imply massive agency rents; the big losses occur outside of the principal-agent problem, and the agency literature can’t assess the plausibility of these losses. Extensions to PAL could potentially shed light on the size of agency rents in this scenario, which are an important determinant of the future influentialness of AI systems.
Mapped onto a PAL model, the Bostrom/Yudkowsky scenario is largely about the principal’s unawareness of the agent’s catastrophic actions. Unawareness models are rare in PAL probably because they usually aren’t very insightful. This lack of insightfulness also seems to prevent existing PAL models or possible extensions from teaching us much about this scenario.
There are also a number of more general difficulties with using PAL to assess AI risk, some more problematic than others.
PAL models rarely consider weak principals and more capable agents
PAL models are brittle
Agency rents are too narrow a measure
PAL models typically assume contract enforceability
PAL models typically assume AIs work for humans because they are paid
Overall, findings from PAL do not straightforwardly transfer to the AI risk scenarios considered, so don’t provide much evidence for or against these scenarios. But new agency models could teach us about the levels of agency rents which AI agents could extract.
PAL and Christiano’s AI risk scenarios
Christiano’s scenario has two parts:
Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. (“Going out with a whimper.”)
Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. (“Going out with a bang,” an instance of optimization daemons.)
Hanson argued that “Christiano instead fears that as AIs get more capable, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will actually become worse off as as result. And not just a bit worse off; we apparently get apocalypse level worse off!”
PAL isn’t in tension with Christiano’s story and isn’t especially informative
We asked Christiano whether his scenario actually implies extremely high agency rents. He doesn’t think so:
On my view the problem is just that agency rents make AI systems collectively better off. Humans were previously the sole superpower and so as a class we are made worse off when we introduce a competitor, via the possibility of eventual conflict with AI who have been greatly enriched via agency rents…humans are better off in absolute terms unless conflict leaves them worse off (whether military conflict or a race for scarce resources). Compare: a rising China makes Americans better off in absolute terms. Also true, unless we consider the possibility of conflict....[without conflict] humans are only worse off relative to AI (or to humans who are able to leverage AI effectively). The availability of AI still probably increases humans’ absolute wealth. This is a problem for humans because we care about our fraction of influence over the future, not just our absolute level of wealth over the short term.
Christiano’s concern isn’t that agency rents will skyrocket because of some distinctive features of the human-AI agency relationship. Instead, “proxies” and “influence seeking” are two specific ways AI interests will diverge from actual human goals. This leads to typical levels of agency rents; PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents.
The main loss occurs later in time and outside of the principal-agent context, due to the fact that these rents eventually lead AIs to wield more total influence on the future than humans. This is bad because, even if humanity is richer overall, we humans also “care about our fraction of influence over the future.” Compared to a world with aligned AI systems, humanity is leaving value on the table, permanently if these systems can’t be rooted out. The biggest potential downside comes from influence-seeking systems which Christiano believes could make humans worse off absolutely, by engaging in violent conflict.
These later failures aren’t examples of massive agency rents (as the term is used in PAL) because failure is not expected to occur when the agent works on the task it was delegated. Rather, the influence-seeking systems become more influential via typical agency rents, and then at some later point use these rents to influence the future, possibly by entering into conflict with humans. PAL studies the size of agency rents which can be extracted, but not what the agents decide to do with this wealth and influence.
Overall, PAL is consistent with AI agents extracting some agency rents, which occurs in both parts of Christiano’s story (and we’ll see next that putting more structure on agency models could tell us more about the level of rent extraction). But it has nothing to say about the plausibility of AI agents using their rents to exert influence over the long term future (parts 1 and 2) or engage in conflict (part 2).
Extending agency models seems promising for understanding the level of agency rents in Christiano’s scenario
Christiano’s scenario doesn’t rely on something distinctive about the human-AI agency relationship generating higher-than-usual agency rents. But perhaps there is something distinctive and rents will be atypical. In any case, the level of agency rents seems like a crucial consideration: if we think AI’s can extract little to no rents, we probably shouldn’t expect them to exert much influence over the future, because agency rents are what make AI rich. Agency models could help give us a better understanding of the size of agency rents in Christiano’s story, and for future AI systems more generally.
The size of agency rents are determined by a number of factors, including the agent’s private information, the nature of the task, the noise in the principal’s estimate of the value produced by the agent, and the degree of competition. For instance, more complex tasks tend to cause higher rents. From The (ir)resistible rise of agency rents:
In the presence of moral hazard, principals must leave rents to agents, to incentivize appropriate actions. The more complex and opaque the task delegated to the agent, the more difficult it is to monitor his actions, the larger his rents.
If, as AI agents become more intelligent, monitoring gets increasingly difficult, or tasks get more complex, then we would expect agency rents to increase.
On the other hand, competitive pressures between AI agents might be greater (it’s easy to copy and run an AI; it’s hard to increase the human workforce by transferring human capital from one brain to another via teaching). This would limit rents:
The agents desire to capture rents, however, could be kept in check by market forces and competition among [agents]. If each principal could run an auction with several, otherwise identical, [agents], he could select the agent with the smallest incentive problem, and hence the smallest rent.
Modelling the most relevant factors in an agency model seems like a tractable research question (we discuss some potential difficulties below). Economists have only just started thinking about AI, and there doesn’t seem to be any work studying rent extraction by AI agents.
PAL and AI risk from “accidents”
Ben Garfinkel has called the class of risks most associated with Bostrom and Yudkowsky, risks from “accidents”. Garfinkel characterises the general story in the following terms:
First, the author imagines that a single AI system experiences a massive jump in capabilities. Over some short period of time, a single system becomes much more general or much more capable than any other system in existence, and in fact any human in existence. Then given the system, researchers specify a goal for it. They give it some input which is meant to communicate what behavior it should engage in. The goal ends up being something quite simple, and the system goes off and single-handedly pursues this very simple goal in a way that violates the full nuances of what its designers intended.” Importantly, “At the limit you might worry that these safety failures could become so extreme that they could perhaps derail civilization on the whole.
These catastrophic accidents constitute the main worry.
If the risk scenario is adequately represented by a principal-agent problem, agency rents extracted by AI agents can be used to measure the cost of misalignment. This time agency rents are a better measure, because failure is expected to occur when the agent works on the task it was delegated. The scenario implies very high agency rents, with the principal being made much worse off because he delegated the task to the agent.
As Garfinkel’s nomenclature suggests, this story is about the designers being caught by surprise, not anticipating the actions the AI would take. The Wikipedia synopsis of Superintelligence also emphasizes that something unexpected occurs: “Solving the control problem is surprisingly difficult because most goals, when translated into machine-implementable code, lead to unforeseen and undesirable consequences.” In other words, the principal is unaware of some specific catastrophically harmful actions that the agent can take to achieve its goal. This could be because they incorrectly believe that the system doesn’t have certain capabilities, or they don’t foresee that certain actions satisfy the agent’s goal, as with perverse instantiation. Due to this, the agent takes actions that greatly harm the principal, at great benefit to herself.
PAL doesn’t tell us much about AI risk from accidents
Hanson’s critique was aimed at Christiano’s scenario, but it could equally apply to this one. Is PAL at odds with this scenario?
As an AI agent becomes more intelligent, it’s action set will expand, thinking of new and sometimes unanticipated actions to achieve its goals. This may include catastrophic actions that the principal is not aware of. PAL can’t tell us what these actions will be, nor if the principal will be aware of them.
Instead, the vast majority of principal-agent models assume that the principal understands the environment perfectly, including perfect knowledge of the agent’s action set, while the premise of the accident scenario is that the principal is unaware of a catastrophic action that the agent could take. Because the principal’s unawareness is central, these models assume, rather than show, that this source of AI risk does not exist. They therefore don’t tell us much about the plausibility of AI accidents.
Microeconomist Daniel Garrett expressed this point nicely. We asked him about a hypothetical example, slightly misremembered from Stuart Russell’s book, concerning an advanced climate control AI system. He replied:
You can easily write down a model where the agent is rewarded according to some outcome, and the principal isn’t aware the outcome can be achieved by some action the principal finds harmful. In your example, the outcome is the reduction of Co2 emissions. If the principal thinks carbon sequestration is the only way to achieve this, but doesn’t think of another chemical reaction option which would indirectly kill everyone, she could end up providing incentives to kill everyone. The fact this conclusion is so immediate may explain why this kind of unawareness by the principal is given little attention in the literature. The principal-agent literature should not be understood as saying that these kinds of incentives with perverse outcomes cannot happen. (our emphasis)
PAL models do typically have modest agency rents; they typically don’t model the principal as being unaware of actions with catastrophic consequences. But this is the situation discussed by proponents of AI accident risk, so we can’t infer much from PAL except that such a situation has not been of much interest to economists.
Extending agency models doesn’t seem promising for understanding AI risk from “accidents”
Most PAL models don’t include the kind of unawareness needed to model the accident scenario, but extensions of this sort are certainly possible. However, we suspect trying to model AI risk in this way wouldn’t be fruitful, for three main reasons.
Firstly, as Daniel Garrett suggests, we suspect the assumptions about the principal’s unawareness of the agents action set would imply the action chosen by the agent, and its consequences for the principal, in a fairly direct and uninteresting way. There is a (very) small sub-literature on unawareness in agency problems where one can find models like this. In one paper, a principal hires an agent to do a work task, but isn’t aware that the agent can manipulate “short-run working performance at the expense of the employer’s future benefit.” The agent “is better off if he is additionally aware that he could manipulate the working performance,” and “in the post-contractual stage, [the principal] is hurt by the manipulating action of [the agent].” However, the model didn’t reveal anything unexpected about the situation, and the outcome was directly determined by the action set and unawareness assumptions.
Secondly, the major source of the uncertainty surrounding accident risk concerns whether the principal will be unaware of catastrophic agent actions. The agency literature can’t help us reduce this uncertainty as the unawareness is built into models’ assumptions. For instance, AI scientist Yann LeCun thinks that harmful actions “are easily avoidable by simple terms in the objective”. If LeCun implemented a superintelligent AI in this way, agency models couldn’t tell us whether he had correctly covered all bases.
Lastly, the assumptions about the agent’s action set would be highly speculative. We don’t know what actions superintelligent systems might take to pursue their goals. Agency models must make assumptions about these actions, and we don’t know what these assumptions should be.
In short, the uncertainty pertains to the assumptions of the model, not the way the assumptions translate into outcomes. PAL does not, and probably can not, provide much evidence for or against this scenario.
General difficulties with using PAL to assess AI risk
We’ve discussed the most relevant considerations regarding what PAL can tell us about two specific visions of AI risk. We now discuss some difficulties relevant to a broader set of possible scenarios (including those just examined). We list the difficulties from most serious to least serious.
PAL models rarely consider weak principals and more capable agents
AI risk scenarios typically involve the AI being more intelligent than humans. The type of problems that economists study usually don’t have this feature, and there seem to be very few models where the principal is weaker than the agent. Despite extensive searching, including talking to multiple contract theorists, we were only able to find two papers with a principal who is more boundedly rational than the agent. This is perhaps not so surprising given that bounded-rationality models are relatively rare, and when they do exist, they tend to bound both the principal and the agent in the same way, or have the principal more capable. The latter is because such a set up is more relevant to typical economic problems, e.g. “exploitative” contracting studies the mistakes made by an individual (the agent) when interacting with a more capable firm (the principal).
Microeconomist Takuro Yamashita agrees:
Most economic questions related to bounded rationality explored in the principal-agent literature are appropriately modelled by a bounded agent. It’s certainly possible to bound the principal, but by and large this hasn’t been done, just because of the nature of the questions that have been asked.
A recent review of Behavioural Contract Theory also finds that such models are rare:
In almost all applications, researchers assume that the agent (she) behaves according to one psychologically based model, while the principal (he) is fully rational and has a classical goal (usually profit maximization).
There doesn’t seem to be, in Hanson’s terms, a “large (mostly economic) literature on agency failures” with an intelligence gap relevant to AI risk.
PAL models are brittle
PAL models don’t model agency problems in general. They consider very specific agency relationships, studied in highly structured environments. Conclusions can depend very sensitively on the assumptions used; findings from one model don’t necessarily generalise to new situations. From the textbook Contract Theory:
The basic moral hazard problem has a fairly simple structure, yet general conclusions have been difficult to obtain...Very few general results can be obtained about the form of optimal contracts. However, this limitation has not prevented applications that use this paradigm from flourishing...Typically, applications have put more structure on the moral hazard problem under consideration, thus enabling a sharper characterization of the optimal incentive contract.” (our emphasis)
Similar reasoning applies in adverse selection models where the outcome is very sensitive to the mapping between effort and outcomes. Given an arbitrary problem, the optimal incentives can look like anything.
The agency problems studied by economists are typically quite different to the scenarios envisaged by AI risk proponents. Therefore, because of the brittleness of PAL models, we shouldn’t be too surprised if the imagined AI risk outcomes aren’t present in the existing literature. PAL, in its current form, might just not be of much use. Further, we should not expect there to be any generic answer to the question “How big are AI agency rents?”: the answer will depend on the specific task the AI is doing and a host of other details.
Agents rents are too narrow a measure
As we’ve seen, AI risk scenarios can include bad outcomes that aren’t agency rents, but that we nevertheless care about. When applying PAL to AI risk, care must be taken to distinguish between rents and other bad outcomes, and we cannot assume that a bad outcome necessarily means high rents.
PAL models typically assume contract enforceability
Stuart Armstrong argued that Hanson’s critique doesn’t work because PAL assumes contract enforceability, and with advanced AI, institutions might not be up to the task. Indeed, contract enforceability is assumed in most of PAL, so it’s an important consideration regarding their applicability to AI scenarios more broadly.
The assumption isn’t plausible in pessimistic scenarios where human principals and institutions are insufficiently powerful to punish the AI agent, e.g. due to very fast take-off. But it is plausible for when AIs are similarly smart to humans, and in scenarios where powerful AIs are used to enforce contracts. Furthermore, if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs; so we should expect contracts to be enforceable conditional upon AIs being used.
There is a smaller sub-literature on self-enforcing contracts (seminal paper). Here contracts can be self-enforced because both parties have an interest in interacting repeatedly. We think these probably won’t be helpful for understanding situations without contract enforceability, because in worlds where contracts aren’t enforceable because of advanced AI, contracts likely won’t be self-enforcing either. If AIs are powerful enough that institutions like the police and military can’t constrain them, it seems unlikely that they’d have much to gain from repeated cooperative interactions with human principals. Why not make a copy of themselves to do the task, coerce humans into doing it, or cooperate with other advanced AIs?
PAL models typically assume AIs work for humans because they are paid
In reality AIs will probably not receive a wage, and instead work for humans because that is their default behaviour. We think changing this would probably not make a big difference to agency models, because the wage could be substituted for other resources the AI cares about. For instance, AI needs compute to run. If we substitute “wage” for “compute”, the agency rents that the agent extracts is additional compute that it can use for its own purposes.
There is a sub-literature on Optimal Delegation that does away with wages. This literature focuses on the best way to restrict the agents action set. For AI agents, this is equivalent to AI boxing. We don’t think this literature will be helpful; PAL doesn’t study how realistic it is to box AI successfully, it just assumes it’s technologically possible. It therefore isn’t informative about whether AI boxing will work.
There are similarities between the AI alignment and principal-agent problems, suggesting that PAL could teach us about AI risk. However, the situations economists have studied are very different to those discussed by proponents of AI risk, meaning that findings from PAL don’t transfer easily to this context. There are a few main issues. The principal-agent setup is only a part of AI risk scenarios, making agency rents too narrow a metric. PAL models rarely consider agents more intelligent than their principals and the models are very brittle. And the lack of insight from PAL unawareness models severely restricts their usefulness for understanding the accident risk scenario.
Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.
Thanks to Wei Dai for pointing out a previous inaccuracy
Agency rents are about e.g. working vs shirking. If the agent uses the money she earned to buy a gun and later shoot the principal, clearly this is very bad for her, but it’s not captured by agency rents.
It’s not totally clear to us why we should care about our fraction of influence over the future, rather than the total influence. Probably because the fraction of influence affects the total influence, influence being zero-sum and resources finite.
It wasn’t clear to us from the original post, at least in Part 1 of the story with no conflict, that humans are better off in absolute terms. For instance, wording like “over time those proxies will come apart” and “People really will be getting richer for a while” seemed to suggest that things are expected to worsen. Given this, Hanson’s interpretation (that Christiano’s story implied massive agency rents) seems reasonable without further clarification. Ben Garfinkel mentioned an outside-view measure which he thought undermined the plausibility of Part 1: since the industrial revolution we seem to have been using more and more proxies, which are optimized for more and more heavily, but things have been getting better and better. So he also seems to have understood the scenario to mean things get worse in absolute terms.
Clarifying what it means for an AI system to earn and use rents also seems important, helping us make sure that the abstraction maps cleanly onto the practical scenarios we are envisaging. Relatedly, what traits would an AI system need to have for it to make sense to think of the system as “accumulating and using rents”? Rents can be cashed out in influence of many different kinds — a human worker might get higher wage, or more free time — and what ends up occuring will depend on the capabilities of the AI systems. Concretely, money can be saved in a bank account, people can be influenced, or computer hardware can be bought and run. One example of an obvious capability constraint for AI: some AI systems will be “switched off” after they are run, limiting their ability to transfer rents through time. As AI agents will (initially) be owned by humans, historical instances of slaves earning rents seem worth looking into.
Although his scenario is more plausible if a smarter agent extracts more agency rents.
Hanson and Christiano agree on this point. Hanson: “Just as most wages that slaves earned above subsistence went to slave owners, most of the wealth generated by AI could go to the capital owners, i.e. their slave owners. Agency rents are the difference above that minimum amount.” Christiano: “Agency rents are what makes the AI rich. It’s not that computers would “become rich” if they were superhuman, and they just aren’t rich yet because they aren’t smart enough. On the current trajectory computers just won’t get rich.”
One limitation is that rents are the cost to the principal, whereas the accident scenario has costs for all humanity. This distinction isn’t especially important because in the accident scenario the outcome for the principal is catastrophic (i.e. extremely high agency rents), and this is what is potentially in tension with PAL. Nonetheless, we should keep in mind that the total costs of this scenario are not limited to agency rents, just as in Christiano’s scenario.
Perhaps a more realistic framing: the principal is aware that there’s some probability that the agent will take an unanticipated catastrophic action, without knowing what that action might be. Under competitive pressures, maybe in a time of war, it could be beneficial for the principal to delegate (in expectation) despite significant risk, while humanity is made worse off (in expectation). This, of course, would be modelled quite differently to the accident AI risk we consider in the text, and we suspect that economic models would confirm that principals would take the risk in sufficiently competitive scenarios. These models would focus on negative externalities of risky AI development, something more naturally studied in domains like public economics rather than with agency theory. In any case, we focus here on the more traditional AI risk framing along the lines of “you think you have the AI under control, but beware, you could be wrong”.
AI accident risk will be large when the AI agent thinks of new actions that i) harm the principal ii) further the agent’s goals iii) the principal hasn’t anticipated.
This is because claims about the actions available to the agent and the principal’s awareness are part of PAL models’ assumptions. We discuss this more below.
The correct example: “If you prefer solving environmental problems, you might ask the machine to counter the rapid acidification of the oceans that results from higher carbon dioxide levels. The machine develops a new catalyst that facilitates an incredibly rapid chemical reaction between ocean and atmosphere and restores the oceans’ pH levels. Unfortunately, a quarter of the oxygen in the atmosphere is used up in the process, leaving us [humans] to asphyxiate slowly and painfully.”
I.e. the principal’s rationality is bounded to a greater extent than the agent’s
In the model in “Moral Hazard With Unawareness” either the principal or the agent’s rationality can be bounded
As argued above, we don’t think contract enforceability is the main reason Hanson’s critique of Christiano fails; agency rents are just not unusually high in his scenario.
From Contract Theory: “The benchmark contracting situation that we shall consider in this book is one between two parties who operate in a market economy with a well-functioning legal system. Under such a system, any contract the parties decide to write will be enforced perfectly by a court, provided, of course, that it does not contravene any existing laws.”
Thanks to Ben Garfinkel for pointing this out.
Robin Hanson pointed out to us that when thinking about strange future scenarios, we should try to think about similar strange scenarios that we have seen in the past (we are very sympathetic to this, despite our somewhat skeptical position regarding PAL). With this in mind, another field which seems worth looking into is Security, especially military security. National leaders have been assassinated by their guards; kings have been killed by their protectors. These seem like a closer analogue to many AI risk scenarios than the typical PAL setup. It seems important to understand what the major risk factors are in these situations, how people have guarded against catastrophic failures, and how this translates to cases of catastrophic AI risk.