Maybe it’s better phrased as “a CIRL agent has a positive incentive to allow shutdown iff it’s uncertain [or the human has a positive term for it being shut off]”, instead of “a machine” has a positive incentive iff.
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.
I would further charitably rewrite it as:
“In chapter 16, we analyze an incentive which a CIRL agent has to allow itself to be switched off. This incentive is positive if and only if it is uncertain about the human objective.”
A CIRL agent should be capable of believing that humans terminally value pressing buttons, in which case it might allow itself to be shut off despite being 100% sure about values. So it’s just the particular incentive examined that’s iff.