How to judge moral learning failure

A putative new idea for AI control; index here.

I’m finding many different results, that show problems and biases with the reward learning process.

But there is a meta problem, which is answering the question: “If the AI gets it wrong, how bad is it?”. It’s clear that there are some outcomes which might be slightly suboptimal – like the complete extinction of all intelligent life across the entire reachable universe. But it’s not so clear what to do if the error is smaller.

For instance, suppose that there are two moral theories, M1 and M2. An AI following M1 would lead to outcome O1, a trillion trillion people leading superb lives. An AI following M2 would lead to outcome O2, a trillion people leading superbly superb lives.

If one of M1 or M2 was the better moral theory from the human perspective, how would we assess the consequences of the AI choosing the wrong one? One natural way of doing this is to use these moral theories to assess each other. How bad, from M1’s perspective, is O2 compared with O1? And vice versa for M2. Because of how value accumulates on different theories, it’s plausible that M1(O1) could be a million times better than M1(O2) (as O1 is finely adapted to M1’s constraints). And similarly, M2(O1) could be a million times worse than M2(O2).

So from that perspective, the cost of choosing the wrong moral theory is disastrous, by a factor of a million. But from my current perspective (and that of most humans), the costs do not seem so large; both O1 and O2 seem pretty neat. I certainly wouldn’t be willing to run a 99.9999% chance of human extinction, in exchange for the AI choosing the better moral theory.

Convincing humans what to value

At least part of this stems, I think, from the fact that humans can be convinced of many things. A superintelligent AI could convince us to value practically anything. Even if we restrict ourselves to “non-coercive” or “non-manipulative” convincing (and defining those terms is a large part of the challenge), there’s still a very wide space of possible future values. Even if we restrict ourselves massively – future values we might have, with no AI convincing at all, just my continuing to live our lives according to vagaries of fortune – that’s still a wide span.

So our values are underdetermined in important ways (personal example: I didn’t expect I’d become an effective altruist or an expected utility maximiser or ever come to respect (some) bureaucracies). So saying “you will come to value M1, and M1 ranks O1 way above O2” doesn’t mean that you should value O1 way above O2. As it’s possible, given different future interactions, that you would come to value M2, or something similar.

We should also give some thought to our future values in impossible universes. It’s perfectly plausible that if we existed in a specific universe (eg a fantasy Tolkein universe), we might come to value, non-coercively and non-manipulatively, a moral theory M3. Even though it’s almost certain we would never come to value M3 in our physical or social universe. We are still proto-M3 believers.

I’m thinking of modelling this as classical moral uncertainty over plausive value/reward functions in a set R={Ri}, but assuming that the probability of a given Ri is never assumed to go below a certain probability. There’s an irreducible core of multiple moralities that never goes away. The purpose of this core is not to inform our future decisions, or train the AI, but purely to assess the goodness of different moralities in the future.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer