Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u’ = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).
Then if the AI is motivated to maximise u’ (assume for the moment that it can’t affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won’t do Z.
Then, after it notices the message is read, it shifts to assuming Y happened—equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong—that it’s more likely in a world outside of V entirely where neither X nor Y happened—but it still tries, on the off-chance that it’s in W.
However, since it’s an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.
Again, maybe I’m misunderstanding something—but it sounds as if you’re agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.
I thought we were exploring a disagreement here; is there still one?
Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u’ = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).
Then if the AI is motivated to maximise u’ (assume for the moment that it can’t affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won’t do Z.
Then, after it notices the message is read, it shifts to assuming Y happened—equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong—that it’s more likely in a world outside of V entirely where neither X nor Y happened—but it still tries, on the off-chance that it’s in W.
However, since it’s an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.
Again, maybe I’m misunderstanding something—but it sounds as if you’re agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.
I thought we were exploring a disagreement here; is there still one?
I think there is no remaining disagreement—I just want to emphasise that before the AI observes such evidence, it will behave the way we want.