If you don’t have a resource, then do you have a list of pointers to what people should learn? For example the policy gradient theorem and the REINFORCE trick. It will probably not be exhaustive, I’m just trying to make your call to learn more RL theory more actionable to people here.
I don’t think the takeaway here should be “read these books / watch these lectures / understand these concepts and you’ll be fine”. My claim is more like, if you want to interact with some community, you should have whatever background knowledge that community expects. Even if I just made a list of concepts, I’d expect that list to be out of date reasonably quickly (a few years), for a field like deep RL.
I think this is pretty important if you want to do any of:
Convince researchers in the field that their work would be risky if scaled up
Learn from evidence presented in papers from the field (this post)
Forecast questions relevant to the field, for questions that don’t have obvious base rates (e.g. AGI timelines)
If you don’t have the background knowledge, you can rely on someone else who has such background knowledge.
Notably, this is not important if you want to “build basic theory” or something like that, which doesn’t require interaction with the AI community. (Though it might be important for guiding your search for basic theory, I’m not sure.)
Also, I forgot to mention this before: normally for deep RL I’d recommend Spinning Up in Deep RL, though in this case that’s too focused on deep RL and not enough on RL basics.
----
EDIT: An analogy: if someone asked a handyman for a list of resources on how to fix common house problems, it’s not clear that the handyman would have remembered to give the advice “turn clockwise to tighten, and counterclockwise to loosen”, because it’s so ingrained. Similarly, I think if I had tried to give a list prior to seeing this post, I would not have thought to give the advice “think about what the optimal policy is, and then expect your RL algorithms to find similar policies”.
The handyman might not give basic advice, but if he didn’t have any advice, I would assume that he doesn’t want to help.
I’m really confused by your answers. You have a long comment criticizing the lack of basic RL knowledge of the AF community, and when I ask you for pointers, you say that you don’t want to give any, and that people should just learn the background knowledge. So should every member of the AF stop what they’re doing right now to spend 5 years doing a PhD in RL before being able to post here?
If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective. If you don’t, I can’t see many people investing the time to learn enough RL so that by osmosis they can understand a point you’re making.
If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective.
Here’s an obvious next step for people: google for resources on RL, ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading. I think people would be better off executing that algorithm than looking at specific resources that I might name.
I wouldn’t be surprised if other people have better algorithms for self-learning new fields—I’m pretty atypical and shouldn’t be expected to know what works for people who aren’t me. E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice.
I would hope most AF readers are capable of coming up with and executing something like this algorithm. If not, there are bigger problems than the lack of RL knowledge.
----
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice [for learning RL]
I have been summoned! I’ve read a few RL textbooks… unfortunately, they’re either a) very boring, b) very old, or c) very superficial. I’ve read:
Reinforcement Learning by Sutton & Barto (my book review)
Nice book for learning the basics. Best textbook I’ve read for RL, but that’s not saying much.
Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
AI: A Modern Approach 3e by Russell & Norvig (my book review)
Engaging and clear, but most of the book wasn’t about RL. Outdated, but 4e is out now and maybe it’s better.
Markov Decision Processes by Puterman
Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
Neuro-Dynamic Programming by Tsitsiklis
When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it’s a classic. It’s very dry and was written in 1996. Pass.
OpenAI’s several-page web tutorial Spinning Up with Deep RL is somehow the most useful beginning RL material I’ve seen, outside of actually taking a class. Kinda sad.
So when I ask my brain things like “how do I know about bandits?”, the result isn’t “because I read it in {textbook #23}”, but rather “because I worked on different tree search variants my first summer of grad school” or “because I took a class”. I think most of my RL knowledge has come from:
My own theoretical RL research
the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
Watercooler chats with other grad students
Sorry to say that I don’t have clear pointers to good material.
I do share your opinion on the Sutton and Barto, which is the only book I read from your list (except a bit of the Russell and Norvig, but not the RL chapter). Notably, I took a lot of time to study the action value methods, only to realise later that a lot of recent work focus instead of policy-gradient methods (even if actor critics do use action-values).
From your answer and Rohin’s, I gather that we lack a good resource in Deep RL, at least of the kind useful for AI Safety researchers. It makes me even more curious of the kind of knowledge that would be treated in such a resource.
Here’s an obvious next step for people: google for resources on RL, ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading.
Agreed. Which is exactly why I asked you for recommendations. I don’t think you’re the only one someone interested in RL should ask for recommendation (I already asked other people, and knew some resource before all this), but as one of the (apparently few) members of the AF with the relevant skills in RL, it seemed that you might offer good advice on the topic.
About self-learning, I’m pretty sure people around here are good on this count. But knowing how to self-learn doesn’t mean knowing what to self-learning. Hence the pointers.
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
No, I don’t think you should only point to a problem with a concrete solution in hands. But solving a research problem (what MIRI’s case is about) is not the same as learning a well-established field of computer science (what this discussion is about). In the latter case, you ask for people to learn things that already exists, not to invent them. And I do believe that showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
That being said, it’s perfectly okay if you don’t want to propose anything. I’m just confused because it seems low effort for you, net positive, and the kind of “ask people for recommendation” that you preach in the previous comment. Maybe we disagree on one of these points?
Which is exactly why I asked you for recommendations.
Yes, I never said you shouldn’t ask me for recommendations. I’m saying that I don’t have any good recommendations to give, and you should probably ask other people for recommendations.
showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
In practice I find that anything I say tends to lose its nuance as it spreads, so I’ve moved towards saying fewer things that require nuance. If I said “X might be a good resource to learn from but I don’t really know”, I would only be a little surprised to hear a complaint in the future of the form “I deeply read X for two months because Rohin recommended it, but I still can’t understand this deep RL paper”.
If I actually were confident in some resource, I agree it would be more effective to mention it.
I’m just confused because it seems low effort for you, net positive, and the kind of “ask people for recommendation” that you preach in the previous comment.
I’m not convinced the low effort version is net positive, for the reasons mentioned above. Note that I’ve already very weakly endorsed your mention of Sutton and Barto, and very weakly mentioned Spinning Up in Deep RL. (EDIT: TurnTrout doesn’t endorse Sutton and Barto much, so now neither do I.)
In practice I find that anything I say tends to lose its nuance as it spreads, so I’ve moved towards saying fewer things that require nuance. If I said “X might be a good resource to learn from but I don’t really know”, I would only be a little surprised to hear a complaint in the future of the form “I deeply read X for two months because Rohin recommended it, but I still can’t understand this deep RL paper”.
Hum, I did not think about that. It makes more sense to me now why you don’t want to point people towards specific things. I still believe the result will be net positive if the right caveat are in place (then it’s the other’s fault for misinterpreting your comment), but that’s indeed assuming that the resource/concept is good/important and you’re confident in that.
This is an aside, but I remain really confused by the claim that RL algorithms will tend to find policies close to the optimal one. Is inductive bias not a thing for RL?
It’s a thing, and is one of the caveats I mentioned.
For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function).
For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment.
I think the proper version of the claim is more like “if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it”. In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work.
But regardless, I claim the first-order approximation to what an RL algorithm will do is “the optimal policy”. You can then figure out reasons for deviation, e.g. “this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior”.
If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).
If you don’t have a resource, then do you have a list of pointers to what people should learn? For example the policy gradient theorem and the REINFORCE trick. It will probably not be exhaustive, I’m just trying to make your call to learn more RL theory more actionable to people here.
I don’t think the takeaway here should be “read these books / watch these lectures / understand these concepts and you’ll be fine”. My claim is more like, if you want to interact with some community, you should have whatever background knowledge that community expects. Even if I just made a list of concepts, I’d expect that list to be out of date reasonably quickly (a few years), for a field like deep RL.
I think this is pretty important if you want to do any of:
Convince researchers in the field that their work would be risky if scaled up
Learn from evidence presented in papers from the field (this post)
Forecast questions relevant to the field, for questions that don’t have obvious base rates (e.g. AGI timelines)
If you don’t have the background knowledge, you can rely on someone else who has such background knowledge.
Notably, this is not important if you want to “build basic theory” or something like that, which doesn’t require interaction with the AI community. (Though it might be important for guiding your search for basic theory, I’m not sure.)
Also, I forgot to mention this before: normally for deep RL I’d recommend Spinning Up in Deep RL, though in this case that’s too focused on deep RL and not enough on RL basics.
----
EDIT: An analogy: if someone asked a handyman for a list of resources on how to fix common house problems, it’s not clear that the handyman would have remembered to give the advice “turn clockwise to tighten, and counterclockwise to loosen”, because it’s so ingrained. Similarly, I think if I had tried to give a list prior to seeing this post, I would not have thought to give the advice “think about what the optimal policy is, and then expect your RL algorithms to find similar policies”.
It’s the other way around, right?
Lol yes fixed
The handyman might not give basic advice, but if he didn’t have any advice, I would assume that he doesn’t want to help.
I’m really confused by your answers. You have a long comment criticizing the lack of basic RL knowledge of the AF community, and when I ask you for pointers, you say that you don’t want to give any, and that people should just learn the background knowledge. So should every member of the AF stop what they’re doing right now to spend 5 years doing a PhD in RL before being able to post here?
If the goal of your comment was to push people to learn things you think they should know, pointing towards some stuff (not an exhaustive list) is the bare minimum for that to be effective. If you don’t, I can’t see many people investing the time to learn enough RL so that by osmosis they can understand a point you’re making.
Here’s an obvious next step for people: google for resources on RL, ask others for recommendations on RL, try out some of the resources and see which one works best for you, and then choose one resource and dive deep into it, potentially repeat until you understand new RL papers by reading. I think people would be better off executing that algorithm than looking at specific resources that I might name.
I wouldn’t be surprised if other people have better algorithms for self-learning new fields—I’m pretty atypical and shouldn’t be expected to know what works for people who aren’t me. E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice.
I would hope most AF readers are capable of coming up with and executing something like this algorithm. If not, there are bigger problems than the lack of RL knowledge.
----
I also don’t buy that pointing out a problem is only effective if you have a concrete solution in mind. MIRI argues that it is a problem that we don’t know how to align powerful AI systems, but doesn’t seem to have any concrete solutions. Do you think this disqualifies MIRI from talking about AI risk and asking people to work on solving it?
I have been summoned! I’ve read a few RL textbooks… unfortunately, they’re either a) very boring, b) very old, or c) very superficial. I’ve read:
Reinforcement Learning by Sutton & Barto (my book review)
Nice book for learning the basics. Best textbook I’ve read for RL, but that’s not saying much.
Superficial, not comprehensive, somewhat outdated circa 2018; a good chunk was focused on older techniques I never/rarely read about again, like SARSA and exponential feature decay for credit assignment. The closest I remember them getting to DRL was when they discussed the challenges faced by function approximators.
AI: A Modern Approach 3e by Russell & Norvig (my book review)
Engaging and clear, but most of the book wasn’t about RL. Outdated, but 4e is out now and maybe it’s better.
Markov Decision Processes by Puterman
Thorough, theoretical, very old, and very boring. Formal and dry. It was written decades ago, so obviously no mention of Deep RL.
Neuro-Dynamic Programming by Tsitsiklis
When I was a wee second-year grad student, I was independently recommended this book by several senior researchers. Apparently it’s a classic. It’s very dry and was written in 1996. Pass.
OpenAI’s several-page web tutorial Spinning Up with Deep RL is somehow the most useful beginning RL material I’ve seen, outside of actually taking a class. Kinda sad.
So when I ask my brain things like “how do I know about bandits?”, the result isn’t “because I read it in {textbook #23}”, but rather “because I worked on different tree search variants my first summer of grad school” or “because I took a class”. I think most of my RL knowledge has come from:
My own theoretical RL research
the fastest way for me to figure out a chunk of relevant MDP theory is often just to derive it myself
Watercooler chats with other grad students
Sorry to say that I don’t have clear pointers to good material.
Thanks for the in-depth answer!
I do share your opinion on the Sutton and Barto, which is the only book I read from your list (except a bit of the Russell and Norvig, but not the RL chapter). Notably, I took a lot of time to study the action value methods, only to realise later that a lot of recent work focus instead of policy-gradient methods (even if actor critics do use action-values).
From your answer and Rohin’s, I gather that we lack a good resource in Deep RL, at least of the kind useful for AI Safety researchers. It makes me even more curious of the kind of knowledge that would be treated in such a resource.
Agreed. Which is exactly why I asked you for recommendations. I don’t think you’re the only one someone interested in RL should ask for recommendation (I already asked other people, and knew some resource before all this), but as one of the (apparently few) members of the AF with the relevant skills in RL, it seemed that you might offer good advice on the topic.
About self-learning, I’m pretty sure people around here are good on this count. But knowing how to self-learn doesn’t mean knowing what to self-learning. Hence the pointers.
No, I don’t think you should only point to a problem with a concrete solution in hands. But solving a research problem (what MIRI’s case is about) is not the same as learning a well-established field of computer science (what this discussion is about). In the latter case, you ask for people to learn things that already exists, not to invent them. And I do believe that showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
That being said, it’s perfectly okay if you don’t want to propose anything. I’m just confused because it seems low effort for you, net positive, and the kind of “ask people for recommendation” that you preach in the previous comment. Maybe we disagree on one of these points?
Yes, I never said you shouldn’t ask me for recommendations. I’m saying that I don’t have any good recommendations to give, and you should probably ask other people for recommendations.
In practice I find that anything I say tends to lose its nuance as it spreads, so I’ve moved towards saying fewer things that require nuance. If I said “X might be a good resource to learn from but I don’t really know”, I would only be a little surprised to hear a complaint in the future of the form “I deeply read X for two months because Rohin recommended it, but I still can’t understand this deep RL paper”.
If I actually were confident in some resource, I agree it would be more effective to mention it.
I’m not convinced the low effort version is net positive, for the reasons mentioned above. Note that I’ve already very weakly endorsed your mention of Sutton and Barto, and very weakly mentioned Spinning Up in Deep RL. (EDIT: TurnTrout doesn’t endorse Sutton and Barto much, so now neither do I.)
Hum, I did not think about that. It makes more sense to me now why you don’t want to point people towards specific things. I still believe the result will be net positive if the right caveat are in place (then it’s the other’s fault for misinterpreting your comment), but that’s indeed assuming that the resource/concept is good/important and you’re confident in that.
This is an aside, but I remain really confused by the claim that RL algorithms will tend to find policies close to the optimal one. Is inductive bias not a thing for RL?
It’s a thing, and is one of the caveats I mentioned.
For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function).
For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment.
I think the proper version of the claim is more like “if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it”. In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work.
But regardless, I claim the first-order approximation to what an RL algorithm will do is “the optimal policy”. You can then figure out reasons for deviation, e.g. “this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior”.
If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).