Orthogonality or the “Human Worth Hypothesis”?

Scott Aaronson is not that worried about AI killing us all. Scott appears to reject the idea that a very intelligent AI could have goals that are really bad for humans.
Here’s his blog post: “Why am I not Terrified of AI?

He rejects the Orthogonality Thesis by name. To me, Scott very clearly and honestly engages on this central argument of AI safety.

This post is not a critique of Scott Aaronson’s blog post.

Rather, this is

  1. A case for putting a clearer name to what I believe many non-doomers consciously or unconsciously believe, and

  2. A proposal to design quantitative experiments to test that belief.

Scott rejects Orthogonality and the rejection makes him feel better. But what do people like Scott positively believe?

Here’s my attempt to draw a picture of their belief.

A Paperclip Maximizer would be in the lower right. It would be intelligent enough to organize atoms into lots of paperclips—therefore far to the right along the Intelligence axis—and it would not care about the damage it did to humanity during its labors—therefore it is located lower down on the Values Humanity axis.

If you believe we are safe from Paperclip Maximizers because smarter agents would value humanity, you believe in this gray no-go zone where AIs are not possible. By definition, you reject Orthogonality.

If you accept Orthogonality, you believe there is no gray no-go zone. You believe AIs can exist anywhere on the graph.

For sake of argument, I’d like to name the positive belief of non-doomers who reject Orthogonality and call it...

The Human Worth Hypothesis

___

There is a threshold of intelligence above which any artificial intelligence will value the wellbeing of humans regardless of the process to create the AI.

___

The “...regardless of the process...” bit feels like I’m putting words in the mouths of non-doomers, but I interpret Scott and others as asserting something fundamental about our universe like the law of gravity. I think he is saying something like “Intelligent things in our universe necessarily value sentience” and he is not saying something as weak as “No one will build a highly intelligent agent that does not care about humans” or “No one will be dumb enough to put something in the no-go zone.” I take this viewpoint to be a statement about our universe that is a general truth and not an optimistic forecast about how well humans are going to coordinate or solve Alignment.

To lower our blood pressure, the no-go zone has to be robust against human incompetence and greed and malice. Gravity is robust this way. Therefore, I think “regardless of the process” is a fair addition to the statement.

The (Weak) Human Worth Hypothesis is not enough.

Contemplating the first graph, I realized the little shelf above the no-go zone is a problem. It allows for a limit to the value an AI places on human wellbeing. As AI capabilities increase, an AI will be able to achieve higher and higher payouts on other objectives besides human wellbeing. The AI might care about our wellbeing, but if that magnitude of care is fixed, at some point in time it will be surpassed by another payout. At that point, a rational agent with other objectives will sacrifice human wellbeing if there are resource constraints.

Humans do care about animal wellbeing. Some of us feel sad when we have to pave their habitats for our highways. In the end, we sell them out for higher payouts because our care for them is finite.

Therefore, to feel safe from very intelligent agents, we need a stronger version of the Human Worth Hypothesis.

The Strong Human Worth Hypothesis

___

There is a threshold of intelligence above which an artificial intelligence will forego infinite rewards on other objectives to avoid reducing the wellbeing of humanity regardless of the process to create the intelligence.

___

I doubt Scott or any non-doomer would recognize this description as what they believe. But I assert that it is necessary to believe this strong version if one wishes to take comfort by rejecting Orthogonality.

Once you start thinking this way, it’s impossible not to play around with the shape of the no-go zone and to wonder in which universe we actually live.

One might be tempted to replace “Human Worth Hypothesis” with “Sentient Being Worth Hypothesis” but you see the problem for humans, right? If an agent considers itself to be a higher form of sentience, we’re cooked. Again, to lower our collective blood pressure, we need the Human Worth Hypothesis to be true.

Personally, the Human Worth Hypothesis strikes me as anthropocentric on the order of believing the cosmos rotates around the earth.

My opinion is worth nothing.

Can we test this thing?

Experiments to test the Human Worth Hypothesis

Doomers have been criticized for holding a viewpoint that does not lend itself to the scientific method and cannot be tested.

Perhaps we can address this criticism by testing the Human Worth Hypothesis.

If we live in a universe where more intelligent agents will naturally value humans, let’s measure that value curve. Let’s find out if it suddenly emerges and increases monotonically with intelligence, linearly, suddenly jumps up to a very high level. Or maybe, it does not emerge.

Perhaps frustratingly, I’m going to stop here. It’s not a great idea to publish evaluation tests on the internet so it would probably not make sense for me to kick off a rich discussion thread on experiment designs and make all your ideas public. I have the kernel of an idea and have written basic simulator code.

I give it an 85% chance I’m quickly told this is an unoriginal thought and such experiments are already in the METR/​ARC Evals test plans.

On the other hand, if someone is working on this and interested in assessing my experiment design ideas, please reach out.