I am confused about how the mechanisms and desiderata you lay out here can give meaningful differences of prediction over complete spaces of environments. Maybe it is possible to address this problem separately.
In particular, imagine the following environments:
E1: the outcome is deterministically 0 at even time steps and 1 at odd time steps.
E2: the outcome is deterministically 0 at even time steps up to step 100 and 1 at odd time steps up to step 100, then starts to be drawn randomly based on some uncomputable process.
E3: the outcome is drawn deterministically based on the action taken in a way which happens to give 0 for the first 100 even step actions and 1 for the odd step actions.
All of these deterministically predict all of the first 200 observations with probability 1. I have an intuition that if you get that set of 200 observations, you should be favoring E1, but I don’t see how your update rule makes that possible without some prior measure over environments or some notion of Occam’s Razor.
In the examples you give there are systemic differences between the environments but it isn’t clear to me how the update is handled “locally” for environments that give the same predictions for all observed actions but diverge in the future, which seems sticky to me in practice.
I think I see what I was confused about, which is that there is a specific countable family of properties, and these properties are discrete, so you aren’t worried about locally distinguishing between hypotheses.
I mean distinguishing between hypotheses that give very similar predictions—like the difference between a coin coming up heads 50% vs. 51% of the time.
As I said in my other comment, I think the assumption that you have discrete hypotheses is what I was missing.
Though for any countable set of hypotheses, you can expand that set by prepending some finite number of deterministic outcomes for the first several actions. The limit of this expansion is still countable, and the set of hypotheses that assign probability 1 to your observations is the same at every time step. I’m confused in this case about (1) whether or not this set of hypotheses is discrete and (2) whether hypotheses with shorter deterministic prefixes assign enough probability to allow meaningful inference in this case anyway.
I may mostly be confused about more basic statistical inference things that don’t have to do with this setting.
I am confused about how the mechanisms and desiderata you lay out here can give meaningful differences of prediction over complete spaces of environments. Maybe it is possible to address this problem separately.
In particular, imagine the following environments:
E1: the outcome is deterministically 0 at even time steps and 1 at odd time steps.
E2: the outcome is deterministically 0 at even time steps up to step 100 and 1 at odd time steps up to step 100, then starts to be drawn randomly based on some uncomputable process.
E3: the outcome is drawn deterministically based on the action taken in a way which happens to give 0 for the first 100 even step actions and 1 for the odd step actions.
All of these deterministically predict all of the first 200 observations with probability 1. I have an intuition that if you get that set of 200 observations, you should be favoring E1, but I don’t see how your update rule makes that possible without some prior measure over environments or some notion of Occam’s Razor.
In the examples you give there are systemic differences between the environments but it isn’t clear to me how the update is handled “locally” for environments that give the same predictions for all observed actions but diverge in the future, which seems sticky to me in practice.
I think I see what I was confused about, which is that there is a specific countable family of properties, and these properties are discrete, so you aren’t worried about locally distinguishing between hypotheses.
Can you elaborate on what you meant by locally distinguishing between hypotheses?
I mean distinguishing between hypotheses that give very similar predictions—like the difference between a coin coming up heads 50% vs. 51% of the time.
As I said in my other comment, I think the assumption that you have discrete hypotheses is what I was missing.
Though for any countable set of hypotheses, you can expand that set by prepending some finite number of deterministic outcomes for the first several actions. The limit of this expansion is still countable, and the set of hypotheses that assign probability 1 to your observations is the same at every time step. I’m confused in this case about (1) whether or not this set of hypotheses is discrete and (2) whether hypotheses with shorter deterministic prefixes assign enough probability to allow meaningful inference in this case anyway.
I may mostly be confused about more basic statistical inference things that don’t have to do with this setting.