Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on “Pizza, please!” What is the power of the principal? ``` power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0 ```
Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.
I agree that there was some value in the 2015 paper, and that their formalism is nicer/cleaner/simpler in a lot of ways. I work with the authors—they’re smarter than I am! And I certainly don’t blame them for the effort. I just also think it led to some unfortunate misconceptions, in my mind at least, and perhaps in the broader field.
The single-timestep case actually looks fine to me now, so I return to the multi-timestep case.
I would want to be able to tell the AI to do a task, and then while the AI is doing the task, tell it to shut down, so it shuts down. And the hard part here is that while doing the task the AI doesn’t prevent me from saying it should shut down in some way (because it would get higher utility if it manages to fulfill the values-as-inferred-through-principal-action of the first episode). This seems like it may require a bit of a different formalization than your multi-timestep one (although feel free to try in your formalization).
Do you think your formalism could be extended so it works in the way we want for such a case, and why (or why not)? (And ideally also roughly how?)
(Btw, even if it doesn’t work for the case above, I think this is still really excellent progress and it does update me to think that corrigibility is likely simpler and more feasible than I thought before. Also thanks for writing formalism.)
I’m writing a response to this, but it’s turning into a long thing full of math, so I might turn it into a full post. We’ll see where it’s at when I’m done.
Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on “Pizza, please!” What is the power of the principal?
```
power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0
```
Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.
I agree that there was some value in the 2015 paper, and that their formalism is nicer/cleaner/simpler in a lot of ways. I work with the authors—they’re smarter than I am! And I certainly don’t blame them for the effort. I just also think it led to some unfortunate misconceptions, in my mind at least, and perhaps in the broader field.
Thanks!
The single-timestep case actually looks fine to me now, so I return to the multi-timestep case.
I would want to be able to tell the AI to do a task, and then while the AI is doing the task, tell it to shut down, so it shuts down. And the hard part here is that while doing the task the AI doesn’t prevent me from saying it should shut down in some way (because it would get higher utility if it manages to fulfill the values-as-inferred-through-principal-action of the first episode). This seems like it may require a bit of a different formalization than your multi-timestep one (although feel free to try in your formalization).
Do you think your formalism could be extended so it works in the way we want for such a case, and why (or why not)? (And ideally also roughly how?)
(Btw, even if it doesn’t work for the case above, I think this is still really excellent progress and it does update me to think that corrigibility is likely simpler and more feasible than I thought before. Also thanks for writing formalism.)
I’m writing a response to this, but it’s turning into a long thing full of math, so I might turn it into a full post. We’ll see where it’s at when I’m done.