How likely the AI that knows it’s evil? Or: is a human-level understanding of human wants enough?

I just read for the first time Eliezer’s short story Failed Utopia #4-2, part of the fun theory sequence. I found it fascinating and puzzling at the same time, in particular this part:

The withered figure inclined its head. “I fully understand. I can already predict every argument you will make. I know exactly how humans would wish me to have been programmed if they’d known the true consequences, and I know that it is not to maximize your future happiness but for a hundred and seven precautions. I know all this already, but I was not programmed to care.”
“And your list of a hundred and seven precautions, doesn’t include me telling you not to do this?
“No, for there was once a fool whose wisdom was just great enough to understand that human beings may be mistaken about what will make them happy. You, of course, are not mistaken in any real sense—but that you object to my actions is not on my list of prohibitions.” The figure shrugged again. “And so I want you to be happy even against your will. You made promises to Helen Grass, once your wife, and you would not willingly break them. So I break your happy marriage without asking you—because I want you to be happier.”
“How dare you!” Stephen burst out.
“I cannot claim to be helpless in the grip of my programming, for I do not desire to be otherwise,” it said. “I do not struggle against my chains. Blame me, then, if it will make you feel better. I am evil.”

In the story, the AI describes his programmer as “almost wise.” But I wonder if that gives the programmer too much credit. In retrospect, it seems rather obvious that the programmer should have programmed the AI to reprogram itself the way humans “would wish it to have been programmed if they’d known the true consequences.”

The first problem with that strategy that comes to mind is that there might be no way to give the AI that command, because the point in time at which you need to give the AI its commands is before the point in time when it can understand that command. But apparently, this AI was able to understand the wish that humans be happy, and also understand a list of one hundred and seven precautions. It seems unlikely that it would be able to understand all that, and not understand “reprogram yourself the way we would wish you to have been programmed if we’d known the true consequences.”

Thus, while the scenario seems to be possible, it doesn’t seem terribly likely. It seems to be a scenario where a human successfully did almost all the work needed to make a desirable AI, but made one very stupid mistake. And that line of thought suggests it’s fairly unlikely that we’ll make an evil AI that knows it’s evil, as long as we manage to successfully propagate the meme “program AIs with the command ‘don’t be evil’ if you can” among AI programmers. Personally, I’m inclined to think the bigger risk is an AI with the wrong mix of abilities: say, superhuman abilities in defeating computer security, designing technology, and war planning, but sub-human abilities when it comes understanding what humans want.

It may be that I’ve been taking the story a bit too seriously, and really Eliezer thinks that the hard part is getting the AI to understand commands like “make us happy” and “do what we really want”—perhaps because, as I already said, we will need to give the AI its commands before it can understand such commands automatically, without our elaborating them further.

But a related line of thought: most of the time, with humans, sincerely wanting to follow a command is sufficient to follow it in a non-evil manner. That’s because we’re fairly good at understanding not just the literal meaning of other humans’ words, but also their intentions. In real life (make that present-day, pre-superintelligence real life), the main reason to try to genie-proof a command is if it’s intended to bind humans who don’t want to follow it (which is true of laws and contracts).

The reason humans often don’t want to follow each other’s commands is because evolution shaped us to be selfish and nepotistic. That, however, won’t be a problem with AIs. We can program them to be sincere about following the spirit, not the letter, of our commands, as long as we can get them to understand what that means.

Now a question worth asking here is this: with humans, sincerity plus our actual skill level at understanding each other is sufficient for us to follow commands in a non-evil manner. Now will the same be true of agents with superhuman powers? That is, with superhumanly powerful agents, will sincerity plus human level skill at understanding humans be sufficient for them to follow commands from humans in a non-evil manner?

It seems your answer to that question should have a big impact on how hard you think AI safety is. Because if the answer is “yes,” we have a route to safe AI that could work even if giving the command “reprogram yourself the way we would wish you to have been programmed if we’d known the true consequences” turns out to be too hard.