Cooperation with and between AGI\'s

Link post

[This post is primarily aimed at a Foresight Institute audience. I don’t know whether the average LW reader will learn much from it.]

This post is mostly a response to the Foresight Institute’s book Gaming the Future, which is very optimistic about AI’s being cooperative. They expect that creating a variety of different AI’s will enable us to replicate the checks and balances that the US constitution created.

I’m also responding in part to Eliezer’s AGI lethalities, points 34 and 35, which say that we can’t survive the creation of powerful AGI’s simply by ensuring the existence of many co-equal AGI’s with different goals. One of his concerns is that those AGI’s will cooperate with each other enough to function as a unitary AGI. Interactions between AGI’s might fit the ideal of voluntary cooperation with checks and balances, yet when interacting with humans those AGI’s might function as an unchecked government that has little need for humans.

I expect reality to be somewhere in between those two extremes. I can’t tell which of those views is closer to reality. This is a fairly scary uncertainty.

AI’s will have better tools for cooperation. That seems likely to mean that AI’s will cooperate more with each other than they cooperate with humans.

The effectiveness of the US constitution depends on politicians having limited ability to collude with each other.

I’m not claiming that differing abilities to cooperate will be as important as the cognitive differences between AI’s and humans. I’m focusing here on cooperation because Foresight focuses on it, and many other discussions of AI neglect it.

[I’ll try to use the term AGI when talking about Eliezer’s model, and AI when talking about Foresight’s model. But please don’t rely too much on that fuzzy distinction. ]

Analogies

Let’s look at some examples of cooperation between unequal groups with differing abilities to cooperate.

1.

Cats have significant skill at training human servants to feed and pet them, and often to provide medical care. That’s a fair amount of success at aligning humans with feline values—much more than I would have expected if I’d tried to make a prediction 1000 years ago.

But humans cooperate to spay cats. They talk to each other in ways that create social pressure to spay cats. Veterinarians coordinate in ways that make it easier to spay cats. Without developing human-level cooperation skills, how could cats influence this process?

2.

There are likely some tensions between traditional foraging societies and newer farming societies.

I’m unsure of the details, but I’ll guess there’s a clash between different notions of property rights. Farmers want to own animals such as sheep that roam somewhat freely, whereas foragers are likely to treat those as unowned until killed.

Farmers likely have some military coordination that foragers struggle to imitate.

3.

The Amish are a stranger example. In some respects they coordinate better than most cultures, e.g. they can negotiate lower medical fees because they can credibly promise not to sue (starting a lawsuit will reliably get you kicked out of any Amish community).

Yet they have relinquished the ability to form groups of more than a few hundred people. They have relinquished pretty much all ability to have professional leaders. This limits their ability to perpetuate their culture from encroachment by the WEIRD culture that surrounds them.

E.g. the Amish struggle to preserve an approach to medical care that’s oriented toward 100-person communities, when most doctors are switching to a system where large bureaucracies are responsible for the care of a sometimes dehumanizing national “community”.

4.

I imagine that kindergarten classes have children who want more cookies than their teachers will allow.

The children could in principle organize to negotiate for more cookies. But a single class is likely too small to have much negotiating power. Teachers and school boards have some key communication tools that children lack. Teachers use those tools to coordinate more powerful organizations than the children are likely to manage, and also to accumulate richer models of child behavior based on a century’s worth of teacher experiences [comparable to what AI’s will accumulate, due to an ability to comprehend larger fractions of the internet than any one human could handle. Children have limited time to build rich shared models of teacher behavior—in a year, they’ll cease to be kindergartners].

Cooperation Between AI’s

How might AI’s cooperate differently from humans?

Mostly likely via ways I don’t expect. But here are some guesses.

AI’s will communicate with each other using richer media. They’ll likely exchange neural representations directly (as described in Drexler’s QNR paper). I.e. instead of simplifying a messy concept into a single word in order to send it to another AI, they’ll send something like a function that says how to pattern-match the concept.

Faster response times: I sometimes want to add a copyrighted image to one of my blog posts. Yet I’m deterred by the time needed to figure out whether the copyright holder is willing to give me permission for a reasonable price. AI’s might solve that by standardizing the relevant negotiations so that the blog writer could send the copyright holder a request (including a description of the blog post and bounds on how much the post might be altered in the future) and expect a price within a few seconds.

From Gaming the Future:

Refusing to hire someone who is judged to have breached a contract, is a powerful cultural technology.

AI’s may expect stricter adherence to contract terms. Will AI’s become reluctant to make contracts with a human who missed a contractual deadline by 5 seconds?

How worried will AI’s be about not being hired by humans? I don’t see how to make a very general prediction. Typical modern humans are sometimes scared of being rejected by the Amish or by cats, rarely by foragers. If AI’s become general purpose agents, I’d expect them to similarly stop relying on human employers.

Merging Utility Functions

Eliezer expects AGI’s will reach agreements that minimize conflicts with each other and ensure that each AGI will get at least as much of what they value as they could have gotten via conflict / competition.

The ideal version of this (at least from the perspective of an AGI) involves the AGI’s all reading each other’s code, negotiating terms that give each AGI as much control over the future as it can expect, and adopting a shared utility function that reflects that agreement.

It probably won’t work quite like this. The AGI’s values will likely not be fully described by an explicit utility function. There may be some uncertainty about whether an AGI is deceiving others about what code it’s running and what data it’s using. Will those constraints cripple negotiations?

On the bright side, AGI’s seem pretty unlikely to develop this much cooperation when they’re at roughly human levels of intelligence. Humans have been heavily selected for abilities to model other humans. AGI’s will likely be designed / selected in ways that de-emphasize developing skills like this.

So I expect it will take some hard-to-predict amount of time for AGI’s to substantially adopt something close to a unitary utility function. But I’m unclear what that time will buy us.

It seems very much possible that humans will have some negotiating power that will enable us to get this shared utility function to reflect some of our values. But the difficulty of analyzing this leaves me quite nervous.

Foresight suggests that a unitary utility function would recreate the problems of Soviet central planning. If that’s intended to be a general rule, it seems clearly false. AI’s acting to maximize a unitary vision can, as needed, have components with a diverse set of competing world models and strategies in ways that avoid Soviet-style failures.

Some corporations come unusually close to having a single goal (maximize profits), achieved via giving employees equity. That does not cause more centralized planning of how employees should accomplish that goal, compared to organizations whose employees have more diverse goals. If anything, they have less central planning, because they trust employees more.

Religions are another example. Religions can persuade people to adopt a more unified set of goals.

Religions have a very mixed track record as to whether they promote diverse, decentralized strategies. But we have a decent example of a religion that promotes common goals with little central control: Protestants.

Henrich argues that one part of why the industrial revolution happened is that Protestant culture better enabled cooperation with distant strangers, partly by getting people to care about shared goals such as avoiding hell. Protestants seem better than most religions at enabling such cooperation without limiting the decentralized experimentation that the industrial revolution needed.

This religion analogy should remind us that it’s hard to get a shared utility function right, but also that it can be valuable to do so.

Results?

How does this affect human interests?

Foresight says we should be hopeful as long as all interactions are voluntary.

But it’s not obvious what would cause such a rule to be followed universally. It’s not even obvious what qualifies as a voluntary interaction.

Foresight points to a clear general pattern of most people being better off when voluntary cooperation increases. But it’s unclear why that pattern would persuade humans to stop spaying cats. Is it because a simple universal rule is better than a more complex set of rules? Given current human abilities to agree on good complex rules, the answer seems to be yes. But does that apply to AI’s that are better able to evaluate complex rules?

An Amish issue is that some Amish communities object to displaying bright colors on their buggies. Governments want to require them to use bright orange safety triangles. What does the principle of voluntary interaction say about this dispute? I’m guessing it depends on whether roads are considered to be property or a commons, with the best choice for classifying roads likely varying based on considerations such as population density and culture.

Will we be as safe as the Amish are today, when we live in a society where most decisions are made by AI’s? The Amish are currently thriving in North America, but were unable to perpetuate their culture in Europe. So I’d say their situation is fairly tenuous.

Foresight wants a set of institutions such as property rights, cryptocommerce systems, etc., which enable voluntary cooperation in ways that future agents will prefer to join rather than replace. This strategy is clearly valuable even if it’s an incomplete answer. But we seem barely able to implement such institutions well enough for cooperation between Amish and modern Americans. I expect it will be harder to anticipate what kinds of institutions will be appropriate for AI interactions.

The normal operation of an AI generates some unavoidable waste heat. Does that constitute an involuntary interaction with biological humans?

It seems a small issue today. But if the AI population grows large enough, Earth’s temperature will substantially restrict the activity of biological life. Will the AI’s say “tough luck”? Or decide that human safety requires that all humans be uploaded, with humans not being considered mature enough to decide that for themselves?

These are the kind of issues I’d expect to face in a fairly optimistic scenario, where AI abilities aren’t too different from uploaded humans. I don’t find it hard to imagine worse scenarios.

Much depends on details of AI design that are normally glossed over in the kind of high-level discussion that I’ve been responding to so far in this post. I mostly talked above as if AI’s would mostly behave like humans, but smarter / wiser. That’s neither necessary or desirable. I’ll now shift to focus more on alternatives.

Solutions?

One possibility is that humans generate many AI’s whose utility functions approximate human values, each of which contains random mistakes. In the limit where those mistakes are uncorrelated and an astronomical number of AI’s have equal negotiating power, I expect that to add up to a consensus among AI’s which reliably matches what humans want.

Alas, it seems wildly optimistic for those mistakes to be mostly uncorrelated. It’s also somewhat optimistic to think the AI’s to have equal negotiating power.

Drexler’s Comprehensive AI Services (CAIS) approach offers more promise. If we can ensure that AI’s have limited goals, and we have many AI’s with differently specialized goals, then maybe AI’s will continue to function mostly as tools, rather than quickly becoming more powerful agents than humans.

CAIS is compatible with current AI trends, and can maybe be followed without needing drastic enforcement measures. But it would be reckless to treat it as the default outcome. I analyze the difficulties in my post Agenty AGI—How Tempting?.

[The extent to which AI researchers are tempted might be influenced by claims that agenty AGI’s are powerful. So Eliezer and Gwern might be recklessly endangering us if we’re near a tipping point as to whether people experiment with agenty AGI’s. On the other hand, they might be scaring people who were tempted to build an agenty AGI out of doing AI development.]

I fear this situation is somewhat like computer security: even when a wise programmer would see that the benefits of good security outweigh the costs, it takes extra thought to get it right, and people will overlook the risks unless pressured to look for them.

Foresight’s approach depends on minimizing some types of collusion. The book gives little advice there, so I’ll turn to Drexler. Section 20 of CAIS paper is titled “Collusion among superintelligent oracles can readily be avoided”. There’s a narrow meaning of this claim which seems true and very important. But it could easily mislead people into underestimating the risks.

Here are Drexler’s key strategies for avoiding collusion:

C1) To improve the quality of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.
C2) To identify low-quality or misleading answers, it is natural to employ diverse critics, any one of which could disrupt deceptive collusion.
C3) Systems of diverse, competing proposers and critics naturally implement both independent and adversarial objectives.
C4) It is natural to apply fixed (hence memory-free) system instantiations to multiple problems, incidentally yielding a series of history-blind, single-move decisions.
C5) It is natural to provide differentiated, task-relevant information to systems solving different problems, typically omitting knowledge of general circumstances.

These all have costs. The costs don’t look prohibitive. It might be natural for someone with a security mindset to happily pay those costs. They don’t look like costs that a typical software developer will naturally pay. These strategies seem to require widespread concern over AI collusion in order to work well.

The optimistic view is that Drexler’s ideas will buy us time to use superintelligent advice to create a more stable solution. I’ll give that a 50% chance of working.

Polarized Opinions

Why is there so much disagreement on these topics?

A good deal of this disagreement seems to come from wildly differing models of how AI will develop, leading to disagreements about what threats to defend against.

How much diversity should we expect among the most powerful AI’s? That depends a lot on how rapidly AI’s become more powerful.

Some people see lots of flaws in human intelligence, and expect that there are simple ways to make AI’s without those flaws. That implies lots of low hanging fruit, enabling a big first-mover advantage. I find that view partly misleading, in that many of those “flaws” are self-serving biases.

Other people see humans as having close enough to ideal reasoning abilities that AI’s will only surpass them in narrow domains for the foreseeable future. That implies that the most powerful AI’s will be no more dangerous than a large corporation, and that it will be hard for any one AI to defeat a modest group of other AI’s that ally against the first AI.

Neither of these extremes seem very plausible to me.

Eliezer might well be anthropomorphizing when he concludes that AI’s need human-like agency in order to be powerful.

Foresight sounds like they’re anthropomorphizing when they imagine that an AI world dictator would engage in Soviet-style central planning of the economy.

Both Foresight and Eliezer are in some sense worrying about inequality of power. Foresight focuses on the degree of inequality that has been typical within human societies, whereas Eliezer focuses on inequalities that are more like the differences between humans and chimpanzees. Neither side seems much interested in worrying about the threat model that the other is focused on. Chimpanzees are not helped much by a better balance of power among humans. It’s natural to ignore plans for better relations between co-equal agents if you see a god-like agent on the horizon.

Eliezer and Foresight most likely agree on the enormous risks that are involved in trying to get a benevolent world dictator. Disagreements here are likely to result from differing beliefs about the alternatives.

From Gaming the Future:

From our perspective, any best case scenario arising from this notion is a worst case scenario, the one we must prevent at all costs. Any unipolar takeover of the world is unlikely to be benevolent. Rather than hoping that this unprecedented power over the world will be shaped by “the right kinds of people”, history tells us that powerful positions attract those who want power.

The “prevent at all costs” reminds me of the slogan “live free or die”. It might be a good strategy for uniting people against a foreign invader. But I disapprove of taking it literally enough to conclude that living under a world dictator is as bad as human extinction.

I’d much rather live under an AI world dictator that accurately reflects Putin’s goals than be dead. Putin and I share an enormous amount of goals simply by virtue of being human. And to the extent that we’re at risk of having one person write their goals into a world dictator, it’s more likely to be a younger version of Bill Gates or Elon Musk.

Yes, a world dictator is scary. But opinions about that explain very little of the disagreement about AI safety.

As with computer security, the main enemies of AI safety are apathy and confusion, not bad people seeking power.

Another source of differing opinions involves disagreement on how much safety to aim for. Eliezer thinks we need a dramatic increase in safety by the time AI’s exceed human intelligence. Foresight seems willing to accept roughly the same kinds of risks as our ancestors have faced.

[It seems odd that Eliezer and Foresight’s attitudes are nearly the reverse for computer security, with Foresight eager to achieve an ideal solution soon, whereas Eliezer is resigned to leaving that problem for AGI’s to solve.]

Closing Thoughts

Checks and balances between AI’s would at best leave humans in a precarious position. We need additional strategies, such as those that Drexler is exploring.

I imagine that Foresight’s view is oriented a bit too much toward re-fighting the cold war, and Eliezer’s view is influenced a bit too much by science fiction.

I think my analysis in this post has alternated somewhat between using Foresight’s world model and using Eliezer’s. If that makes it sound confusing, that likely reflects me being still somewhat confused.

I will try someday to write a book review of the rest of Gaming the Future. It has some valuable parts that are mostly independent of this post’s topics.

P.S. - see also a loosely related post Multiple AIs in boxes, evaluating each other’s alignment. See also Paul Christiano’s more complex analysis of collusion indicating there’s value to having AI’s with heterogeneous objectives, with some caveats. I’m more worried about the caveats than he seems to be.

Cooperation with and between AGI\’s