Yes, that was the comment I meant to leave but apparently didn’t: it’s just another bias-variance tradeoff. In the limit (say, 20b frames...) all of these regularizations and auxiliary tasks (and/or informative priors) are either going to be neutral or hurt converged performance compared to pure end-to-end reward-only learning. And they should, if you do them right, help most early on when data is scarce and the end-to-end reward-only approach hasn’t been able to learn much. This isn’t post hoc, it’s just what any ML person should predict from bias-variance tradeoff. The devil is in the details, though, and you could be doing any of it wrong or not be where you think you are in the tradeoff, and that’s where this sort of research finding lives.
Yes, that was the comment I meant to leave but apparently didn’t: it’s just another bias-variance tradeoff. In the limit (say, 20b frames...) all of these regularizations and auxiliary tasks (and/or informative priors) are either going to be neutral or hurt converged performance compared to pure end-to-end reward-only learning. And they should, if you do them right, help most early on when data is scarce and the end-to-end reward-only approach hasn’t been able to learn much. This isn’t post hoc, it’s just what any ML person should predict from bias-variance tradeoff. The devil is in the details, though, and you could be doing any of it wrong or not be where you think you are in the tradeoff, and that’s where this sort of research finding lives.