Principles in AI alignment

WikiLast edit: 16 Feb 2017 18:54 UTC by Eliezer Yudkowsky

A ‘principle’ of AI alignment is something we want in a broad sense for the whole AI, which has informed narrower design proposals for particular parts or aspects of the AI.

For example:

The Non-adversarial principle says that the AI should never be searching for a way to defeat our safety measures or do something else we don’t want, even if we think this search will come up empty; it’s just the wrong thing for us to program computing power to do.
- This informs the proposal of Value alignment problem: we ought to build an AI that wants to attain the class of outcomes we want to see.
- This informs the proposal of Corrigibility, subproposal Utility indifference: if we build a suspend button into the AI, we need to make sure the AI experiences no instrumental pressure to disable the suspend button.
The Minimality principle says that when we are building the first aligned AGI, we should try to do as little as possible, using the least dangerous cognitive computations possible, that is necessary in order to prevent the default outcome of the world being destroyed by the first unaligned AGI.
- This informs the proposal of Mild optimization and Taskishness: We are safer if all goals and subgoals of the AI are formulated in such a way that they can be achieved as greatly as preferable using a bounded amount of effort, and the AI only exerts enough effort to do that.
- This informs the proposal of Behaviorism: It seems like there are some pivotal-act proposals that don’t require the AI to understand and predict humans in great detail, just to master engineering; and it seems like we can head off multiple thorny problems by not having the AI trying to model humans or other minds in as much detail as possible.

Please be guarded about declaring things to be ‘principles’ unless they have already informed more than one specific design proposal and more than one person thinks they are a good idea. You could call them ‘proposed principles’ and post them under your own domain if you personally think they are a good idea. There are a lot of possible ‘broad design wishes’, or things that people think are ‘broad design wishes’, and the principles that have actually already informed specific design proposals would otherwise get lost in the crowd.

No comments.

Prin­ci­ples in AI alignment

Principles in AI alignment