I don’t quite understand why AIXI would protect its memory from modification, or at least not why it would reason as you describe (though I concede that I’m quite likely to be missing something).
In what sense can AIXI perceive a memory corruption as corresponding to the universe “changing”? For example, if I change the agent’s memory so that one Grue never existed, it seems like it perceives no change: starting in the next step it believes that the universe has always been this way, and will restrict its attention to world models in which the universe has always been this way (not in which the memory alteration is causally connected to the destruction of a Grue, or indeed in which the memory alteration has any effect at all on the rest of the world).
It seems like your discussion presumes that a memory modification at time T somehow effects your memories of times near T, so that you can somehow causally associate the inconsistencies in your world model with the memory alteration itself.
I suppose you might hope for AIXI to learn a mapping between memory locations and sense perceptions, so that it can learn that the modification of memory location X leads causally to flipping the value of its input at time T (say). But this is not a model which AIXI can even support without learning to predict its own behavior (which I have just argued leads to nonsensical behavior), because the modification of the memory location occurs after time T, so generally depends on the AI’s actions after time T, so is not allowed to have an effect on perceptions at time T.
I’m assuming a situation where we’re not able to make a completely credible alteration. Maybe the AIXI’s memories about the number of grues goes: 1 grue, 2 grues, 3 grues, 3 grues, 5 grues, 6 grues… and it knows of no mechanisms to produce two grues at once (in its “most likely” models) and other evidence in its memory is consistent with their being 4 grues, not three. So it can figure out there are particular odd moments where the universe seems to behave in odd ways, unlike most moments. And then it may figure out that these odd moments are correlated with human action.
ETA: misunderstood the parent. So it might think our actions made a grue, and would enjoy being told horrible lies which it could disprove. Except I don’t know how this interacts with Eliezer’s point.
Why are these odd moments correlated with human action? I modify the memory at time 100, changing a memory of what happened at time 10. AIXI observes something happen at time 10, and then a memory modification at time 100. Perhaps AIXI can learn a mapping between memory locations and instants in time, but it can’t model a change which reaches backwards in time (unless it learns a model in which the entire history of the universe is determined in advance, and just revealed sequentially, in which case it has learned a good enough self-model to stop caring about its own decisions).
I was suggesting that that if the time difference wasn’t too large, the AIXI could deduce “humans plan at time 10 to press button” → “weirdness at time 10 and button pressed at time 100″. If it’s good a modelling us, it may be able to deduce our plans long before we do, and as long as the plan predates the weirdness, it can model the plan as causal.
Or if it experiences more varied situations, it might deduce “no interactions with humans for long periods” → “no weirdness”, and act in consequence.
I don’t quite understand why AIXI would protect its memory from modification, or at least not why it would reason as you describe (though I concede that I’m quite likely to be missing something).
In what sense can AIXI perceive a memory corruption as corresponding to the universe “changing”? For example, if I change the agent’s memory so that one Grue never existed, it seems like it perceives no change: starting in the next step it believes that the universe has always been this way, and will restrict its attention to world models in which the universe has always been this way (not in which the memory alteration is causally connected to the destruction of a Grue, or indeed in which the memory alteration has any effect at all on the rest of the world).
It seems like your discussion presumes that a memory modification at time T somehow effects your memories of times near T, so that you can somehow causally associate the inconsistencies in your world model with the memory alteration itself.
I suppose you might hope for AIXI to learn a mapping between memory locations and sense perceptions, so that it can learn that the modification of memory location X leads causally to flipping the value of its input at time T (say). But this is not a model which AIXI can even support without learning to predict its own behavior (which I have just argued leads to nonsensical behavior), because the modification of the memory location occurs after time T, so generally depends on the AI’s actions after time T, so is not allowed to have an effect on perceptions at time T.
I’m assuming a situation where we’re not able to make a completely credible alteration. Maybe the AIXI’s memories about the number of grues goes: 1 grue, 2 grues, 3 grues, 3 grues, 5 grues, 6 grues… and it knows of no mechanisms to produce two grues at once (in its “most likely” models) and other evidence in its memory is consistent with their being 4 grues, not three. So it can figure out there are particular odd moments where the universe seems to behave in odd ways, unlike most moments. And then it may figure out that these odd moments are correlated with human action.
ETA: misunderstood the parent. So it might think our actions made a grue, and would enjoy being told horrible lies which it could disprove. Except I don’t know how this interacts with Eliezer’s point.
Why are these odd moments correlated with human action? I modify the memory at time 100, changing a memory of what happened at time 10. AIXI observes something happen at time 10, and then a memory modification at time 100. Perhaps AIXI can learn a mapping between memory locations and instants in time, but it can’t model a change which reaches backwards in time (unless it learns a model in which the entire history of the universe is determined in advance, and just revealed sequentially, in which case it has learned a good enough self-model to stop caring about its own decisions).
I was suggesting that that if the time difference wasn’t too large, the AIXI could deduce “humans plan at time 10 to press button” → “weirdness at time 10 and button pressed at time 100″. If it’s good a modelling us, it may be able to deduce our plans long before we do, and as long as the plan predates the weirdness, it can model the plan as causal.
Or if it experiences more varied situations, it might deduce “no interactions with humans for long periods” → “no weirdness”, and act in consequence.