I reason as follows:
Omega inspires belief only after the agent encounters Omega.
According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).
I think either:
the agent does update, in which case, why not update on the result of the coin-flip? or
the agent doesn’t update, in which case, what matters is simply the optimal policy given the original prior.
Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.
I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent’s ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).
RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.
P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = … = null).