tl;dr: I showthat model splintering can be seen as going beyond the human training distribution (the distribution of real and imagined situations we have firm or vague preferences over), and argue why this is at the heart of AI alignment.

You are training an AI-CEO to maximise stock value, training it on examples of good/bad CEO decisions and corresponding stock-price increases/decreases.

There are some obvious failure modes. The AI could wirehead by hacking the stock-ticker, or it could do the usual optimise-the-universe-to-maximise-stock-price-for-now-dead-shareholders.

Let’s assume that we’ve managed to avoid these degenerate solutions. Instead, the AI-CEO tries for something far weirder.

Beyond the human training distribution

The AI-CEO reorients the company towards the production of semi-sentient teddy bears that are generated in part from cloned human brain tissue. These teddies function as personal assistants and companions, and prototypes are distributed at the annual shareholder meeting.

However, the public reaction is negative, and the government bans the further production of these teddies. Consequently, the company shuts down for good. But the shareholders, who own the only existent versions of these teddies, get great kudos from possessing these rare entities, who also turn out to be great and supportive listeners—and excellent at managing their owners’ digital media accounts, increasing their popularity and status.

And that, of course, was the AI-CEO’s plan all along.

Hold off from judging this scenario, just for a second. And when you do judge it, observe your mental process as you do so. I’ve tried to build this scenario so that it is:

Outside the AI-CEO’s training distribution.
Outside the human designer’s implicit training distribution—few of us have thought deeply about the morality of semi-sentient teddy bears made with human tissue.
Possibly aligned with the AI-CEO’s goals.
Neither clearly ideal nor clearly disastrous (depending on your moral views, you may need to adjust the scenario a bit to hit this).

If I’ve pitched it right, your reaction to the scenario should be similar to mine—“I need to think about this more, and I need more information”. The AI-CEO is clearly providing some value to the shareholders; whether this value can be compared to the stock price is unclear. It’s being manipulative, but not doing anything illegal. As for the teddies themselves… I (Stuart) feel uncomfortable that they are grown from human brain tissue, but they are not human, and we humans have relationships with less sentient beings (pets). I’d have to know more about potential suffering and the preferences and consciousness—if any—of these teddies...

I personally feel that, depending on circumstances, I could come down in favour or against the AI-CEO’s actions. If your own views are more categorical, see if you can adjust the scenario until it’s similarly ambiguous for you.

Two types of model splintering

This scenario involved model-splintering in two ways. The first was when the AI-CEO decided to not follow the route of “increase share price”, and instead found another way of giving value to the shareholders, while sending the price to zero. This is unexpected, but it’s not a moral surprise; we can assess its value by trying to quantify the extra value the teddies give their owners, and compare these with the lost share price. We want to check that, whatever model the AI-CEO is using to compare these two values, it’s a sensible one.

The second model-splintering is the morality of creating the teddies. For most of us, this will be a new situation, which we will judge by connecting it to previous values or analogies (excitement about the possibilities, morality of using human tissue, morality of sentient beings whose preferences may or may not be satisfied, morality of the master-servant relationship that this resembles, slippery slope effects vs. early warning, etc).

Like the first time you encounter a tricky philosophical thought experiment, or the first time you deal with ambiguous situations where norms come into conflict, what’s happening is that you are moving beyond your moral training data. This does not fit neatly into previous categories, nor can it easily be analysed with the tools of previous categories. But we are capable of analysing it, somehow, and to come up with non-stupid decisions.

Why this is the heart of AI alignment

Our extrapolated under- (but not un-)defined values

So, we can extrapolate our values in non-stupid ways to these new situations. But that extrapolation may be contingent; a lot may depend on what analogies we reach first, on how we heard about the scenario, and so on.

But let’s re-iterate the “non-stupid” point again. Our contingent extrapolations don’t tend to fail disastrously (at least not when we have to implement our plans). For instance, humans rarely reach the conclusion that wireheading—hacking the stock-ticker—is the moral thing to do.

This skill doesn’t always work (humans are much more likely than AIs to extrapolate into the “actively evil” zone, rather than the “lethally indifferent”) but it is a skill that seems necessary to resolve extrapolated/model splintered situations in non-disastrous ways.

Superintelligences need to solve these issues

See the world from the point of view of a superintelligence. The future is filled with possibilities and plans, many of them far more wild and weird than the example I just defined, most of them articulated in terms of concepts and definitions beyond our current human minds.

And an aligned superintelligence needs to decide what to do about them. Even if it follows a policy that is mostly positive, this policy will have weird, model-splintered side effects. It needs to decide whether these side-effects are allowable, or whether it must devote resources to removing them. Maybe the cheapest company it can create will recruit someone, who, with their new salary, will start making these teddies themselves. It can avoid employing that person—but that’s an extra cost. Should it pay that cost? As it looks upon all human in the world, it can predict their behaviours will change as a result of developing its current company—what behaviour changes are allowed, what should be avoided or encouraged?

Thus it cannot make decisions in these situations without going beyond the human training distribution; hence it is essential that it learns to extrapolate moral values in a way similar to how humans do.

Beyond the human training distribution: would the AI CEO create almost-illegal teddies?

Beyond the human training distribution

Two types of model splintering

Why this is the heart of AI alignment

Our extrapolated under- (but not un-)defined values

Superintelligences need to solve these issues