abramdemski comments on Artificial Intelligence: A Modern Approach (4th edition) on the Alignment Problem

abramdemski 18 Sep 2020 14:41 UTC
4 points
0
Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.