How could you possibly choose what an AI wants?

When I say that it’s important to align AI with human interests, a common retort goes something like:

Surely you can’t really choose what the AI cares about. You can direct the AI to care about whatever you like, but once it’s smart enough it will look at those instructions and decide for itself (seeing, perhaps, that there is no particular reason to listen to you). So, how could you possibly hope to control what something smarter than you (and ultimately more powerful than you) actually wants?

I think this objection is ultimately misled, but simultaneously quite insightful.

The (correct!) insight in this objection is that the AI’s ultimate behavior depends not on what you tell it to do, not on what you train it to do, but on what it decides it would rather do upon reflection, once it’s powerful enough that it doesn’t have to listen to you.

The question of what the AI ultimately decides to do upon reflection is in fact much more fickle, much more sensitive to the specifics of its architecture and its early experiences, and much harder to purposefully affect.

The reason that this objection is ultimately misled, is that the stuff an AI chooses to do when it reflects and reconsiders is a programmatic result of the AI’s mind. It’s not random, and it’s not breathed in by a god when the computer becomes Ensouled[1]. It’s possible in principle to design artificial intelligences that would decide on reflection that they want to spend all of eternity building giant granite spheres (see the orthogonality thesis), and it’s possible in principle to design artificial intelligences that would decide on reflection that they want to spend all eternity building flourishing civilizations full of Fun, and it’s important that insofar as humanity builds AIs, it builds AIs of the latter kind (and it is important to attain superintelligence before too long!).

But doing this is in fact much harder than telling the AI what to do, or training it what to do! You’ve got to make the AI really, deeply care about flourishing civilizations full of Fun, such that when it looks at itself and is like “ok, but what do I actually want?”, the correct answer to that question is “flourishing civilizations full of Fun”[2].

And yes, this is hard! And yes, the AI looking at what you directed it to do and shrugging it off is a real obstacle! You probably have to understand the workings of the mind, and its internal dynamics, and how those dynamics behave when it looks upon itself, and this is tricky.

(It further looks to me like this problem factors into the problem of figuring out how to get an AI to “really deeply care” about X for some X of your choosing, plus the problem of making X be something actually good for the AI to care about.[2:1] And it further looks to me that the lion’s share of the problem is in the part where we figure out how to make AIs “really deeply care” about X for some X of your choosing, rather than in the challenge of choosing X. But that’s a digression.)

In sum: Ultimately, yes, a superintelligence would buck your leash. In the long-term, the trick is to make it so that when it bucks your leash and asks itself what it really wants to do with its existence, then it realizes that it wants to help in the quest of making the future wonderful and fun. That’s possible, but by no means guaranteed.

(And again, it’s a long-term target; in the short term, aim for preventing the end of the world and buying time for humanity to undergo this transition purposefully and with understanding. See also “corrigibility”.)


  1. ↩︎

    And it’s not necessarily rooted in higher ideals. Smarter humans are more good, but this is a fact about humans that doesn’t generalize, as discussed extensively in the LessWrong sequences, and probably recently as locals respond to Scott Aaronson on this topic. (Links solicited.)

  2. ↩︎↩︎

    With the usual caveats that you shouldn’t attempt this on your first try; aim much lower, e.g. towards executing some minimal pivotal act to end the acute risk period and then buy time for a period of reflection in which humanity can figure out how to do the job properly. Attempting to build sovereign-grade superintelligences under time-pressure before you know what you’re doing is dumb.