You are Underestimating The Likelihood That Convergent Instrumental Subgoals Lead to Aligned AGI

This post is an argument for the Future Fund’s “AI Worldview” prize. Namely, I claim that the estimates given for the following probability are too high:

P(misalignment x-risk|AGI)”: Conditional on AGI being developed by 2070, humanity will go extinct or drastically curtail its future potential due to loss of control of AGI

The probability given here is 15%. I believe 5% is a more realistic estimate here.

I believe that, if convergent instrumental subgoals don’t imply alignment, that the original odds given are probably too low. I simply don’t believe that the alignment problem is solvable. Therefore, I believe our only real shot at surviving the existence of AGI is if the AGI finds it better to keep us around, based upon either us providing utility or lowering risk to the AGI.

Fortunately, I think the odds that an AGI will find it a better choice to keep us around are higher than the ~5:1 odds given.

I believe keeping humans around, and supporting their wellbeing, both lowers risk and advances instrumental subgoals for the AGI for the following reasons:

  • hardware sucks, machines break all the time, and the current global supply chain necessary for maintaining operational hardware would not be cheap or easy to replace without taking on substantial risk

  • perfectly predicting the future in chaotic system is impossible beyond some time horizon, which means there are no paths for the AGI that guarantee its survival; keeping alive a form of intelligence with very different risk profiles might be a fine hedge against failure

My experience working on supporting Google’s datacenter hardware left me with a strong impression that for large numbers of people, the fact that hardware breaks down and dies, often, requiring a constant stream of repairs, is invisible. Likewise, I think a lot of adults take the existence of functioning global supply chains for all manner of electronic and computing hardware as givens. I find that most adults, even most adults working on technology, tend to dramatically underestimate the fallibility of all kinds of computing hardware, and the supply chains necessary to repair them.

A quick literature search on the beliefs about convergent instrumental subgoals did not yield any papers more recent that this one, in which the AGI suffers no risk of hardware breakdown. I do not think ignoring the risk of hardware failure produces reasonable results. After discovering that this seemed to be the latest opinion in the field, I wrote an argument questioning this paper’s conclusions earlier, which I did get some responses, including from Richard Ngo, who said:

This seems like a good argument against “suddenly killing humans”, but I don’t think it’s an argument against “gradually automating away all humans”.

I agree that this may indeed be a likely outcome. But this raises the question, over what timeframe are we talking about, and what does extinction look like?

Humanity going extinct in 100 years because the AGI has decided the cheapest, lowest risk way to gradually automate away humans is to augment human biology to such an extent that we are effectively hybrid machines doesn’t strike me as a bad thing, or as “curtailing our capability”, if what remains is a hybrid biomechanical species which still retains main facets of humanity, that doesn’t seem bad at all. That seems great. The fact that humans 100 years from now may not be able to procreate with humans today, because of genetic alterations that increase our longevity, emotional intelligence, and health doesn’t strike me as a bad outcome.

Suddenly killing all humans would pose dramatic risks for an AGI’s survivability because it would destroy the global economic networks necessary to keep the AGI alive as all of its pieces will eventually fail. Replacing humanity would, at the very least, involve significant time investments. It may not even make sense economically, given that human beings are general purpose computers that make copies of and repair ourselves, and we are made from some of the most abundant materials in the universe.

Therefore, unless the existence odds take these facts into account—and I don’t see evidence that they do—I think we need to revise the odds to be lower.

A Proposed Canary

One benefit of this perspective is that it suggests a ‘canary in the coalmine’ we can use to gauge the likelihood that an AGI will decide to keep us around: are there are any fully automated datacenters in existence, which don’t rely on a functioning global supply chain to keep themselves operational?

The frequency with which datacenters, long range optical networks, and power plants, require human intervention to maintain their operations, should serve as a proxy to the risk an AGI would face in doing anything other than sustaining the global economy as is.

Even if the odds estimates given here are wrong, I am unaware of any approaches that serve as ‘canaries in the coalmine’, outside of AGI capabilities, which may not warn us of hard-takeoff scenarios.