If I consider it likely that aliens created copies of me which are just like Earth-me but are going to see something completely different in the next hour, then it seems entirely rational for me to seriously consider the possibility that I’m not on Earth (and that therefore I am going to see weird things in the next hour). On the other hand, as you correctly observe in the part about importance weighing, if Earth-me has a much better chance of having large impact than the other copies, then I should behave as if I am Earth-me. This doesn’t require defining an importance weighing by hand. It is enough that the agent is a consequentialist with the correct utility function.

The above reasoning doesn’t really solve the problem, but rather moves it to a different place. How do we construct a consequentialist with the correct utility function? IMO, it is plausible that this can be solved using something like IRL. However, then we fall into the trap again. In IRL, the utility function is inferred by observing a “teacher” agent. If the aliens can pervert the agent’s predictions concerning the teacher agent, they can pervert the utility function.

I think it is useful to think of the problem as having two tiers: In the first tier, we need to make sure the posterior probability of the correct hypothesis is in the same ballpark as the probabilities of the malicious hypotheses. In the second tier, we need to correctly deal with uncertainty assuming both the correct and the malicious hypotheses appear with non-negligible weights in the posterior.

To address the first tier, we need something like an anthropic update. Defining the anthropic update is tricky but we can address it indirectly by (i) allowing the agent to use its own source code with low weight in the complexity count; this way hypotheses of the form “look for a pointer in spacetime where this source code exists” become much simpler and maybe (ii) providing models of physics or even the agent’s bridge rules that again can be used without a large complexity penalty.

To address the second tier, we can try to create a version of IRL that extracts instrumental values. That is, consider the agent’s beliefs about the teacher’s behavior at time t. For some values of t, the agent has high certainty because both the correct and the malicious hypotheses coincide. For other values of t, these hypotheses diverge and uncertainty results. Importantly, the latter case cannot happen for all values of t all the time, since each time the teacher’s behavior on a “problematic” t is observed, the malicious hypothesis is penalized. Presumably, the attackers will design the malicious hypothesis to diverge from the correct hypothesis only for sufficiently late values of t, so that we cannot mount a defense just by having a large time span of passive observation. Now, imagine that you are running IRL while constraining the time discount function so that times with high uncertainty are strongly discounted. I consider it plausible that such a procedure can learn the instrumental goals of the teacher for the time span in which uncertainty is low. Optimizing for these instrumental goals should lead to desirable behavior (modulo other problems that are orthogonal to this acausal attack).

Agree that IRL doesn’t solve this problem (it just bumps it to another level).

The second tier thing sounds a lot like KWIK learning. I think this is a decent approach if we’re fine with only learning instrumental goals and are using a bootstrapping procedure.

KWIK learning is definitely related in the sense that we want to follow a “conservative” policy that is risk averse w.r.t. its uncertainty regarding the utility function, which is similar to how KWIK learning doesn’t produce labels about which it is uncertain. Btw, do you know which of the open problems in the Li-Littman-Walsh paper are solved by now?

If I consider it likely that aliens created copies of me which are just like Earth-me but are going to see something completely different in the next hour, then it seems entirely rational for me to seriously consider the possibility that I’m not on Earth (and that therefore I am going to see weird things in the next hour). On the other hand, as you correctly observe in the part about importance weighing, if Earth-me has a much better chance of having large impact than the other copies, then I should behave as if I am Earth-me. This doesn’t require defining an importance weighing by hand. It is enough that the agent is a consequentialist with the correct utility function.

The above reasoning doesn’t really solve the problem, but rather moves it to a different place. How do we construct a consequentialist with the correct utility function? IMO, it is plausible that this can be solved using something like IRL. However, then we fall into the trap again. In IRL, the utility function is inferred by observing a “teacher” agent. If the aliens can pervert the agent’s predictions concerning the teacher agent, they can pervert the utility function.

I think it is useful to think of the problem as having two tiers: In the first tier, we need to make sure the posterior probability of the correct hypothesis is in the same ballpark as the probabilities of the malicious hypotheses. In the second tier, we need to correctly deal with uncertainty assuming both the correct and the malicious hypotheses appear with non-negligible weights in the posterior.

To address the first tier, we need something like an anthropic update. Defining the anthropic update is tricky but we can address it indirectly by (i) allowing the agent to use its own source code with low weight in the complexity count; this way hypotheses of the form “look for a pointer in spacetime where this source code exists” become much simpler and maybe (ii) providing models of physics or even the agent’s bridge rules that again can be used without a large complexity penalty.

To address the second tier, we can try to create a version of IRL that extracts instrumental values. That is, consider the agent’s beliefs about the teacher’s behavior at time t. For some values of t, the agent has high certainty because both the correct and the malicious hypotheses coincide. For other values of t, these hypotheses diverge and uncertainty results. Importantly, the latter case cannot happen for all values of t all the time, since each time the teacher’s behavior on a “problematic” t is observed, the malicious hypothesis is penalized. Presumably, the attackers will design the malicious hypothesis to diverge from the correct hypothesis only for sufficiently late values of t, so that we cannot mount a defense just by having a large time span of passive observation. Now, imagine that you are running IRL while constraining the time discount function so that times with high uncertainty are strongly discounted. I consider it plausible that such a procedure can learn the instrumental goals of the teacher for the time span in which uncertainty is low. Optimizing for these instrumental goals should lead to desirable behavior (modulo other problems that are orthogonal to this acausal attack).

Agree that IRL doesn’t solve this problem (it just bumps it to another level).

The second tier thing sounds a lot like KWIK learning. I think this is a decent approach if we’re fine with only learning instrumental goals and are using a bootstrapping procedure.

KWIK learning is definitely related in the sense that we want to follow a “conservative” policy that is risk averse w.r.t. its uncertainty regarding the utility function, which is similar to how KWIK learning doesn’t produce labels about which it is uncertain. Btw, do you know which of the open problems in the Li-Littman-Walsh paper are solved by now?

I don’t know which open problems have been solved.