This article was originally on the FHI wiki and is being reposted to LW Discussion with permission. All content in this article is credited to Daniel Dewey.

In value loading, the agent will pick the action:

argmaxa∈A∑w∈Wp(w|e,a)∑u∈Uu(w)p(C(u)|w)

Here A is the set of actions the agent can take, e is the evidence the agent has already seen, W is the set of possible worlds, and U is the set of utility functions the agent is considering.

The parameter C(u) is some measure of the ‘correctness’ of the utility u, so the term p(C(u)|w) is the probability of u being correct, given that the agent is in world w. A simple example is of an AI that completely trusts the programmers; so if u is some utility function that claims that giving cake is better than giving death, and w_{1} is a world where the programmers have said “cake is better than death” while w_{2} is a world where they have said the opposite, then p(C(u)|w_{1}) = 1 and p(C(u) | w_{2}) = 0.

There are several challenging things in this formula:

W : How to define/represent the class of all worlds under consideration

U : How to represent the class of all utility functions over such worlds

C : What do we state about the utility function: that it is true? believed by humans?

p(C(u)|w) : How to define this probability

∑u∈Uu(w) : How to sum up utility functions (a moral uncertainty problem)

In contrast:

∑w∈Wp(w|e,a)

is mostly the classic AI problem. It is hard to predict what the world is like from evidence, but this is a well known and studied problem and not unique to the present research. There is a trick to it here in that the nature of w includes the future actions of the agent which will depend upon how good future states look to it, but this recursive definition eventually bottoms out like a game of chess (where what happens when I make a move depends on what moves I make after that). It may cause an additional exponential explosion in calculating out the formula though, so the agent may need to make probabilistic guesses as to its own future behaviour to actually calculate an action.

This value loading equation is not subject to the classical Cake or Death problem, but is vulnerable to the more advanced version of the problem, if the agent is able to change the expected future value of p(C(u)) through its actions.

Daniel Dewey’s Paper

The above idea was partially inspired by a draft of Learning What to Value, a paper by Daniel Dewey. He restricted attention to streams of interactions, and his equation, in a simplified form, is:

argmaxa∈A∑s∈Sp(s|e,a)∑u∈Uu(s)p(u|s)

where S is the set of all possible streams of all past and future observations and actions.

## Value Loading

This article was originally on the FHI wiki and is being reposted to LW Discussion with permission. All content in this article is credited to Daniel Dewey.In value loading, the agent will pick the action:

argmaxa∈A∑w∈Wp(w|e,a)∑u∈Uu(w)p(C(u)|w)

Here A is the set of actions the agent can take, e is the evidence the agent has already seen, W is the set of possible worlds, and U is the set of utility functions the agent is considering.

The parameter C(u) is some measure of the ‘correctness’ of the utility u, so the term p(C(u)|w) is the probability of u being correct, given that the agent is in world w. A simple example is of an AI that completely trusts the programmers; so if u is some utility function that claims that giving cake is better than giving death, and w

_{1}is a world where the programmers have said “cake is better than death” while w_{2}is a world where they have said the opposite, then p(C(u)|w_{1}) = 1 and p(C(u) | w_{2}) = 0.There are several challenging things in this formula:

W : How to define/represent the class of all worlds under consideration

U : How to represent the class of all utility functions over such worlds

C : What do we state about the utility function: that it is true? believed by humans?

p(C(u)|w) : How to define this probability

∑u∈Uu(w) : How to sum up utility functions (a moral uncertainty problem)

In contrast:

∑w∈Wp(w|e,a)

is mostly the classic AI problem. It is hard to predict what the world is like from evidence, but this is a well known and studied problem and not unique to the present research. There is a trick to it here in that the nature of w includes the future actions of the agent which will depend upon how good future states look to it, but this recursive definition eventually bottoms out like a game of chess (where what happens when I make a move depends on what moves I make after that). It may cause an additional exponential explosion in calculating out the formula though, so the agent may need to make probabilistic guesses as to its own future behaviour to actually calculate an action.

This value loading equation is not subject to the classical Cake or Death problem, but is vulnerable to the more advanced version of the problem, if the agent is able to change the expected future value of p(C(u)) through its actions.

Daniel Dewey’s PaperThe above idea was partially inspired by a draft of Learning What to Value, a paper by Daniel Dewey. He restricted attention to streams of interactions, and his equation, in a simplified form, is:

argmaxa∈A∑s∈Sp(s|e,a)∑u∈Uu(s)p(u|s)

where S is the set of all possible streams of all past and future observations and actions.