I’m not entirely sure what you consider to be a “bad” reason for crossing the bridge. However, I’m having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.
One way to define a “bad” reason is an irrational one (or the chicken rule). However, if this is what is meant by a “bad” reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.
To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, “I know the troll will only blow up the bridge if I cross for a bad reason, but I’m generally pretty reasonable, so I think I’ll do fine if I cross”. And then I’d stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I’d just make sure I didn’t do that.
Another way that you might have had in mind is that a “bad” reason is one such that the action the AI takes results in a provably bad outcome despite the AI thinking the action would result in a good outcome, or the reason being the chicken rule. However, in this is the case, it seems to me that no agent would be able to cross the bridge without it being blown up, unless the agent’s counterfactual environment in which it didn’t cross scored less than −10 utility. But this doesn’t seem like a very reasonable counterfactual environment.
To see why, consider an arbitrary agent with the following decision procedure. Let counterfactual
be an arbitrary specification of what would happen in some counterfactual world.
def act():
cross_eu = expected_utility(counterfactual('A = Cross'))
stay_eu = expected_utility(counterfactual('A = Stay'))
if cross_eu > stay_eu:
return cross
return stay
The chicken rule can be added, too, if you wish. I’ll assume the expected utility of staying is greater than −10.
Then it seems you can adapt the proof you gave for your agent to show that an arbitrary agent satisfying the above description would also get −10 utility if it crossed. Specifically,
Suppose .
Suppose ‘A = Cross’
Then the agent crossed either because of the chicken rule or because counterfactual environment in which the agent crossed had utility greater than −10, or the counterfactual environment in which the agent didn’t cross had less than −10 utility. We assumed the counterfactual environment in which the agent doesn’t cross has more than −10 utility. Thus, it must be either the chicken rule or because crossing had more than −10 utility in expectation.
If it’s because of the chicken rule, then this is a “bad” reason, so, the troll will destroy the bridge just like in the original proof. Thus, utility would equal −10.
Suppose instead the agent crosses because expected_utility(counterfactual(A = Cross)) > -10
. However, by the assumption, . Thus, since the agent actually crosses, this in fact provably results in −10 utility and the AI is thus wrong in thinking it would get a good outcome. Thus, the AI’s action results in provably bad outcomes. Therefore, the troll destroys the bridge. Thus, utility would equal −10.
Thus, ’A = Cross \implies U = −10`.
Thus, (.
Thus, by Lob’s theorem,
As I said, you could potentially avoid getting the bridge destroyed by assigning expected utility less than −10 to the counterfactual environment in which the AI doesn’t cross. This seems like a “silly” counterfactual environment, so it doesn’t seem like something we would want an AI to think. Also, since it seems like a silly thing to think, a troll may consider the use of such a counterfactual environment to be a bad reason to cross the bridge, and thus destroy it anyways.
I’ve made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren’t valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.
The posts are:
My critique of a published impact measure.
Manual alignment
Alignment via reverse engineering