Here’s a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we’re giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)
Important disanalogies seem: 1) Most humans aren’t good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time. 2) The listener will assume that [score highly on niceness] isn’t the human’s only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating]) 3) A fairly large proportion of humans are nice (I think!).
The second could be addressed somewhat by raising the stakes. The first seems hard to remedy within this analogy. I’d be a little concerned that people initially buy it, then think for themselves and conclude “But if we design a really clever niceness test, then it’d almost always work—all we need is clever people to work for a while on some good tests”. Combined with (3), this might seem like a decent solution.
Overall, I think what’s missing is that we’d expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn’t going to have this intuition in the human-niceness-test case.
I expect this is very susceptible to opinions about human nature. To someone who thinks humans ARE generally nice, they are likely to answer “yes, of course” to your question. To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is “who knows?”. But the latter group doesn’t need to be convinced, we’re already worried.
Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don’t have to look very far into human history to see quite a lot of human-committed atrocities.
I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I’d still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.
Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don’t buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.
Here’s a two-sentence argument for misalignment that I think is both highly compelling to laypeople and technically accurate at capturing the key issue:
When we train AI systems to be nice, we’re giving a bunch of random programs a niceness exam and selecting the programs that score well. If you gave a bunch of humans a niceness exam where the humans would be rewarded for scoring well, do you think you would get actually nice humans?
To me it seems a solid attempt at conveying [misalignment is possible, even with a good test], but not necessarily [misalignment is likely, even with a good test]. (not that I have a great alternative suggestion)
Important disanalogies seem:
1) Most humans aren’t good at convincingly faking niceness (I think!). The listener may assume a test good enough to successfully exploit this most of the time.
2) The listener will assume that [score highly on niceness] isn’t the human’s only reward. (both things like [desire to feel honest] and [worry of the consequences of being caught cheating])
3) A fairly large proportion of humans are nice (I think!).
The second could be addressed somewhat by raising the stakes.
The first seems hard to remedy within this analogy. I’d be a little concerned that people initially buy it, then think for themselves and conclude “But if we design a really clever niceness test, then it’d almost always work—all we need is clever people to work for a while on some good tests”.
Combined with (3), this might seem like a decent solution.
Overall, I think what’s missing is that we’d expect [our clever test looks to us as if it works] well before [our clever test actually works]. My guess is that the layperson isn’t going to have this intuition in the human-niceness-test case.
I expect this is very susceptible to opinions about human nature. To someone who thinks humans ARE generally nice, they are likely to answer “yes, of course” to your question. To someone who thinks humans are generally extremely context-sensitive, which appears to be nice in the co-evolved social settings in which we generally interact, the answer is “who knows?”. But the latter group doesn’t need to be convinced, we’re already worried.
Surely nobody thinks that all humans are nice all the time and nobody would ever fake a niceness exam. I mean, I think humans are generally pretty good, but obviously that always has to come with a bunch of caveats because you don’t have to look very far into human history to see quite a lot of human-committed atrocities.
I think the answer is an obvious yes, all other things held equal. Of course, what happens in reality is more complex than this, but I’d still say yes in most cases, primarily because I think that aligned behavior is very simple, so simple that it either only barely loses out to the deceptive model or outright has the advantage, depending on the programming language and low level details, and thus we only need to transfer 300-1000 bits maximum, which is likely very easy.
Much more generally, my fundamental claim is that the complexity of pointing to human values is very similar to the set of all long-term objectives, and can be easier or harder, but I don’t buy the assumption that pointing to human values is way harder than pointing to the set of long-term goals.