IMO there are two reasons why finite-state MDPs are useful.
First, proving regret bounds for finite-state MDPs is just easier than for infinite-state MDPs (of course any environment can be thought of as an infinite-state MDP), so it serves as good warm-up even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finite-state MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).
Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where “features” in this case corresponds to a mapping from histories to states. Now, of course there needn’t be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with approximate models (as it must anyway), for example using my quasi-Bayesian approach, then such MDP models can be powerful.
Regarding regret bounds, I don’t think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.
Certainly many problems can be captured already within this simple setting.
Definitely. But I think many of the difficulties with general intelligence are not captured in the simple setting. I certainly don’t want to say there’s no place for MDPs.
continuous MDPs
I don’t quite know what to think of continuous MDPs. I’ll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it’s not a whole lot more powerful than the finite-state MDP formalism.
Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where “features” in this case corresponds to a mapping from histories to states.
Yeah, I think there’s been some good progress on this. But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there’s a good chance that we’ll get for free the don’t-evaluate-plans-twice behavior.
Regarding regret bounds, I don’t think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.
First, you can have a subjective regret bound which doesn’t require all actions to be recoverable (it does require some actions to be approximately recoverable, which is indeed the case in the real world).
Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.
Third, checking regret bounds for priors in which all actions are recoverable serves as a sanity test for candidate AGI algorithms. It is not a sufficient desideratum, but I do think it is necessary.
But I think many of the difficulties with general intelligence are not captured in the simple setting
I agree that some of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?
I don’t quite know what to think of continuous MDPs. I’ll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it’s not a whole lot more powerful than the finite-state MDP formalism.
This seems wrong to me. Can you elaborate what do you mean by “powerful” in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).
But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there’s a good chance that we’ll get for free the don’t-evaluate-plans-twice behavior.
I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain dimension parameter of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.
IMO there are two reasons why finite-state MDPs are useful.
First, proving regret bounds for finite-state MDPs is just easier than for infinite-state MDPs (of course any environment can be thought of as an infinite-state MDP), so it serves as good warm-up even if you want to go beyond it. Certainly many problems can be captured already within this simple setting. Moreover, some algorithms and proof techniques for finite-state MDPs can be generalized to e.g. continuous MDPs (which is already a far more general setting).
Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where “features” in this case corresponds to a mapping from histories to states. Now, of course there needn’t be any projection into a finite state space that preserves the exact dynamics of the environment. However, if your algorithm can work with approximate models (as it must anyway), for example using my quasi-Bayesian approach, then such MDP models can be powerful.
Regarding regret bounds, I don’t think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.
Definitely. But I think many of the difficulties with general intelligence are not captured in the simple setting. I certainly don’t want to say there’s no place for MDPs.
I don’t quite know what to think of continuous MDPs. I’ll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it’s not a whole lot more powerful than the finite-state MDP formalism.
Yeah, I think there’s been some good progress on this. But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there’s a good chance that we’ll get for free the don’t-evaluate-plans-twice behavior.
First, you can have a subjective regret bound which doesn’t require all actions to be recoverable (it does require some actions to be approximately recoverable, which is indeed the case in the real world).
Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.
Third, checking regret bounds for priors in which all actions are recoverable serves as a sanity test for candidate AGI algorithms. It is not a sufficient desideratum, but I do think it is necessary.
I agree that some of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?
This seems wrong to me. Can you elaborate what do you mean by “powerful” in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).
I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain dimension parameter of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.