It’s a thing, and is one of the caveats I mentioned.
For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function).
For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment.
I think the proper version of the claim is more like “if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it”. In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work.
But regardless, I claim the first-order approximation to what an RL algorithm will do is “the optimal policy”. You can then figure out reasons for deviation, e.g. “this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior”.
If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).
It’s a thing, and is one of the caveats I mentioned.
For tabular RL, algorithms can find optimal policies in the limit of infinite exploration, but without infinite exploration how close you get to the optimal policy will depend on the environment (including reward function).
For deep RL, even with infinite exploration you don’t get the guarantee, since the optimization problem is nonconvex, and the optimal policy may not be expressible by your neural net. So it again depends heavily on the environment.
I think the proper version of the claim is more like “if a paper reports results with RL, the policy they find is probably good, as otherwise they wouldn’t have published it”. In practice RL algorithms often fail and need to be heavily tuned to do well, and researchers have to pull out lots of tricks to get them to work.
But regardless, I claim the first-order approximation to what an RL algorithm will do is “the optimal policy”. You can then figure out reasons for deviation, e.g. “this reward is super sparse, so the algorithm won’t get learning signal, so it’ll have effectively random behavior”.
If someone expected RL algorithms to fail on this bandit task, and then updated because they succeeded, I’d find that reasonable (though I’d find it pretty surprising that they’d expect a failure on bandits—it’s a relatively simple task where you can get tons of data).