How to judge moral learning failure

A putative new idea for AI control; index here.

I’m finding many different results, that show problems and biases with the reward learning process.

But there is a meta problem, which is answering the question: “If the AI gets it wrong, how bad is it?”. It’s clear that there are some outcomes which might be slightly suboptimal – like the complete extinction of all intelligent life across the entire reachable universe. But it’s not so clear what to do if the error is smaller.


For instance, suppose that there are two moral theories, M1 and M2. An AI following M1 would lead to outcome O1, a trillion trillion people leading superb lives. An AI following M2 would lead to outcome O2, a trillion people leading superbly superb lives.

If one of M1 or M2 was the better moral theory from the human perspective, how would we assess the consequences of the AI choosing the wrong one? One natural way of doing this is to use these moral theories to assess each other. How bad, from M1’s perspective, is O2 compared with O1? And vice versa for M2. Because of how value accumulates on different theories, it’s plausible that M1(O1) could be a million times better than M1(O2) (as O1 is finely adapted to M1’s constraints). And similarly, M2(O1) could be a million times worse than M2(O2).

So from that perspective, the cost of choosing the wrong moral theory is disastrous, by a factor of a million. But from my current perspective (and that of most humans), the costs do not seem so large; both O1 and O2 seem pretty neat. I certainly wouldn’t be willing to run a 99.9999% chance of human extinction, in exchange for the AI choosing the better moral theory.

Convincing humans what to value

At least part of this stems, I think, from the fact that humans can be convinced of many things. A superintelligent AI could convince us to value practically anything. Even if we restrict ourselves to “non-coercive” or “non-manipulative” convincing (and defining those terms is a large part of the challenge), there’s still a very wide space of possible future values. Even if we restrict ourselves massively – future values we might have, with no AI convincing at all, just my continuing to live our lives according to vagaries of fortune – that’s still a wide span.

So our values are underdetermined in important ways (personal example: I didn’t expect I’d become an effective altruist or an expected utility maximiser or ever come to respect (some) bureaucracies). So saying “you will come to value M1, and M1 ranks O1 way above O2” doesn’t mean that you should value O1 way above O2. As it’s possible, given different future interactions, that you would come to value M2, or something similar.

We should also give some thought to our future values in impossible universes. It’s perfectly plausible that if we existed in a specific universe (eg a fantasy Tolkein universe), we might come to value, non-coercively and non-manipulatively, a moral theory M3. Even though it’s almost certain we would never come to value M3 in our physical or social universe. We are still proto-M3 believers.

I’m thinking of modelling this as classical moral uncertainty over plausive value/​reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability. There’s an irreducible core of multiple moralities that never goes away. The purpose of this core is not to inform our future decisions, or train the AI, but purely to assess the goodness of different moralities in the future.