This question is probably stupid, and also kinda generic (it applies to many other papers besides this one), but forgive me for asking it anyway.
So, I’m trying to think through how this kind of result generalizes beyond MDPs. In my own life, I don’t go wandering around an environment looking for piles of cash that got randomly left on the sidewalk. My rewards aren’t random. Instead, I have goals (or more generally, self-knowledge of what I find rewarding), and I have abstract knowledge constraining the ways in which those goals will or won’t happen.
Yes, I do still have to do exploration—try new foods, meet new people, ponder new ideas, etc.—but because of my general prior knowledge about the world, this exploration kinda feels different than the kind of exploration that they talk about in MDPs. It’s not really rolling the dice, I generally have a pretty good idea of what to expect, even if it’s still a bit uncertain along some axes.
So, how do you think about the generalizability of these kinds of MDP results?
Your paper is about an agent which can perform well in any possible universe. (That’s the “for all ν in ℳ”). That includes universes where the laws of physics suddenly change tomorrow. But in real life, I know that the laws of physics are not going to change tomorrow. Thus, I can get optimal results without doing the kind of exhaustive exploration that your paper is talking about. Agree or disagree?
Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.
A decent intuition might be to think about what exploration looks like in human children. Children under the age of 5 but old enough to move about on their own—so toddlers, not babies or “big kids”—face a lot of dangers in the modern world if they are allowed to run their natural exploration algorithm. Heck, I’m not even sure this is a modern problem, because in addition to toddlers not understanding and needing to be protected from exploring electrical sockets and moving vehicles they also have to be protected from more traditional dangers that they would definitely otherwise check out like dangerous plants and animals. Of course, since toddlers grow up into powerful adult humans, this is a kind of evidence that they are powerful enough explorers (even with protections) to become powerful enough to function in society.
Obviously there are a lot of caveats to taking this idea too seriously since I’ve ignored issues related to human development, but I think it points in the right direction of something everyday that reflects this result.
This question is probably stupid, and also kinda generic (it applies to many other papers besides this one), but forgive me for asking it anyway.
So, I’m trying to think through how this kind of result generalizes beyond MDPs. In my own life, I don’t go wandering around an environment looking for piles of cash that got randomly left on the sidewalk. My rewards aren’t random. Instead, I have goals (or more generally, self-knowledge of what I find rewarding), and I have abstract knowledge constraining the ways in which those goals will or won’t happen.
Yes, I do still have to do exploration—try new foods, meet new people, ponder new ideas, etc.—but because of my general prior knowledge about the world, this exploration kinda feels different than the kind of exploration that they talk about in MDPs. It’s not really rolling the dice, I generally have a pretty good idea of what to expect, even if it’s still a bit uncertain along some axes.
So, how do you think about the generalizability of these kinds of MDP results?
(I like the paper, by the way!)
Well, nothing in the paper has to do with MDPs! The results are for general computable environments. Does that answer the question?
Hmm, I think I get it. Correct me if I’m wrong.
Your paper is about an agent which can perform well in any possible universe. (That’s the “for all ν in ℳ”). That includes universes where the laws of physics suddenly change tomorrow. But in real life, I know that the laws of physics are not going to change tomorrow. Thus, I can get optimal results without doing the kind of exhaustive exploration that your paper is talking about. Agree or disagree?
Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.
Thanks!
A decent intuition might be to think about what exploration looks like in human children. Children under the age of 5 but old enough to move about on their own—so toddlers, not babies or “big kids”—face a lot of dangers in the modern world if they are allowed to run their natural exploration algorithm. Heck, I’m not even sure this is a modern problem, because in addition to toddlers not understanding and needing to be protected from exploring electrical sockets and moving vehicles they also have to be protected from more traditional dangers that they would definitely otherwise check out like dangerous plants and animals. Of course, since toddlers grow up into powerful adult humans, this is a kind of evidence that they are powerful enough explorers (even with protections) to become powerful enough to function in society.
Obviously there are a lot of caveats to taking this idea too seriously since I’ve ignored issues related to human development, but I think it points in the right direction of something everyday that reflects this result.
The last paragraph of the conclusion (maybe you read it?) is relevant to this.