AI Reading Group Thoughts (1/​?): The Mandate of Heaven

My housemate Kelsey “theunitofcaring” has begun hosting an AI reading group in our house. Our first meeting was yesterday evening, and over a first draft attempt at chocolate macarons, we discussed this article about AI safety and efficiency by Paul Christiano, and various ideas prompted thereby at greater or lesser remove.

One idea that came up is what we decided to call “tipping point AI” (because apparently there are a lot of competing definitions for “transformative” AI). The definition we were using for tipping point AI was “something such that it, or its controller, is capable of preventing others from building AIs”. The exact type and level of capability here could vary—for instance, if it’s built after we’ve colonized Mars (that is, colonized it to an extent such that Martians could undertake projects like building AIs), then a tipping point AI has to be able to project power to Mars in some form, even if the only required level of finesse is lethality. But if it’s before we’ve colonized Mars, it can be unable to do that, and just able to prevent colonization projects in addition to AI projects.

One hypothesis that has been floated in a context such that we are pretty sure it is not anyone’s real plan is that an AI could just destroy all the GPUs on the planet and prevent the manufacture of new ones. This would be bad for Bitcoins, video games, and AI projects, but otherwise relatively low-impact. An AI might be able to accomplish this task by coercion, or even by proxy—the complete system of “the AI, and its controller” needs to be able to prevent AI creation by other agents, so the AI itself might only need to identify targets for a controller who already wields enough power to fire missiles or confiscate hardware and chooses to do so in service of this goal, perhaps the US government.

The idea behind creating tipping point AI isn’t that this is where we stop forever. The tipping point AI only has to prevent other agents from building their own in their basements. It eliminates competition. Some features of a situation in which a tipping point AI exists include:

  • The agent controlling the AI can work on more sophisticated second drafts without worrying about someone else rushing to production unsafely.

  • The controlling agent can publish insights and seek feedback without worrying about plagiarism, code forks, etc.

  • They can apply the AI’s other abilities, if any (there will presumably be some, since “prevent AI creation” is not a primitive action—some surveillance capability seems like a minimum to me) to their other problems, perhaps including creating a better AI. Even if this application has economic or other benefits that might attract others to similar solutions by default, the AI will prevent that, so no one will be (productively) startled or inspired into working on AI faster by seeing the results.

However, if you’re an agent controlling a tipping point AI, you have a problem: the bus number* of the human race has suddenly dropped to “you and your cohort”. If anything happens to you—and an AI being tipping point variety doesn’t imply it can help you with all of the things that might happen to you—then the AI is leaderless. This, depending on its construction, might mean it goes rogue and does something weird, that it goes dormant and there’s no protection against a poorly built new AI project, or that it keeps doing whatever its last directive was (in the example under discussion, “prevent anyone from building another AI”). None of these are good states to have obtain permanently.

So you might want to define, and then architect into your AI the definition of, organizational continuity, robustly enough that none of those things will happen.

This isn’t trivial—it’s almost certainly easier than defining human value in general, but that doesn’t mean it’s simple. Your definition has to handle internal schisms, both overt and subtle, ranging from “the IT guy we fired is working for would-be rivals” to “there’s serious disagreement among our researchers about whether to go ahead with Project Turaco, and Frances and Harold are working on a Turaco fork in their garage”. If you don’t want the wrong bus accident (or assassination) to mean that humanity ends, encounters a hard stop in its technological progress, or has its panopticonic meddling intelligence inherited by a random person who chose the same name for their uber-for-spirulina business? Then you need to have a way to pass on the mandate of heaven.

One idea that popped into my head while I was turning over this problem was a code of organizational conduct. This allows the organization to resume after a discontinuity, without granting random people a first-mover advantage at picking up the dropped mantle unless they take it up whole. It’s still a simpler problem than human value in general, but it’s intermediate between that and “define members of a conventional continuous group of humans”. The code has to be something that includes its own decisionmaking process—if six people across the globe adopt a code simultaneously they’ll need to resolve conflicts between them just as much as the original organization did. You presumably want to incorporate security features that protect both against garage forks of Projects Turaco and also against ill-intentioned or not-too-bright inheritors of your code.

Other options include:

  • Conventional organizational continuity. You have, perhaps, a board of directors who never share a vehicle, and they have some sort of input into the executives of the organization, and you hope nobody brings the plague to work, and there is some sort of process according to which decisions are made and some sort of process for defaulting if decisions fail to be made.

  • Designated organizational heirs: if your conventional organization fails, then your sister project, who are laying theoretical groundwork but not building anything yet because you have a tipping point AI and you said so, get the mandate of heaven and can proceed. This assumes that you think their chances of achieving value alignment are worse than yours but better than any other option. This has obvious incentive problems with respect to the other organization’s interest in yours suddenly ceasing to exist.

  • Non-organization based strategies (a line of succession of individuals). People being changeable, this list would need to be carefully curated and carefully maintained by whoever was ascendant, and it would be at substantial risk of unobserved deception, errors in judgment, or evolution over time of heirs’ interests and capabilities after their predecessors can no longer edit the line of succession. These would all be capable of affecting the long term future of humanity once the AI changed hands.

  • I’m sure there are things I haven’t thought of.

I don’t have a conclusion because I just wrote this about thoughts that I had in response to the meeting, to let other people who can’t attend still be in on some of what we’re talking and thinking about.

*The number of people who can be hit by a bus before the organization ceases to function