Just reading this post about Soft Actor Critic in the OpenAI RL tutorial series and stumbled upon this line:
I will now try and make a somewhat provocative claim. Based on what I have seen of RL, I would attribute most “successes” of deep RL models (where “success” just means “anything that gets a human researcher excited/worried”) to something other than this explicit value-maximising objective. By that I mean “you can set up other systems to do RL without neural networks and they don’t really work/scale very well”. In other words, 90% of what makes RL work/not work is not related to that equation (or the original value maximisation equation, or any of its variants) at all.
My current guess is that the actual answer to “why does deep RL work” has something to do with neural networks being pretty good at finding low-error, low complexity functions that accurately capture the symmetries of the training setup/game. Thus instead of being “true maximisers” that would wirehead or turn the whole world into paperclips if given the chance, they are more akin to “systems that are really good at finding symmetries in the data landscape, which we have biased towards finding and exploting symmetries that lead to high reward from a reward function coupled to that data landscape”. Thus follows the standard problems with reward hacking, goal misgeneralisation etc.
If you want to look more into the symmetry learning direction I like GDL as a way of thinking about it: (More canonical resource:) http://geometricdeeplearning.com/
The problem I see with this claim is that in the acedemic realm, good value maximization is what researchers get excited about, even to a fault. It is a lot easier to get a paper published by saying “our method gets a higher reward than previous methods” than “our method does xyz interesting thing”. If researchers could publish a better paperclip maximizer they almost certainly would.
If you instead looked at curiosity algorithms or reward-free (self-supervised) RL, where “success” is a bit more ambiguous, then I would agree that the inductive biases of deep NNs probably play a bigger role than usually acknowledged. In fact, a paper about the role of NN depth on self-supervised RL recently won best paper at NEURIPS: https://wang-kevin3290.github.io/scaling-crl/
@Caleb Biddulph For future reference, what I meant by “set up other systems” is classical RL systems like vanilla Q-learning: https://www.geeksforgeeks.org/machine-learning/q-learning-in-python/ . Today we know Q-learning primarily as deep Q-learning (which was one of Deepmind’s original Big Papers), but it is entirely possible to do Q-learning with no neural networks to learn state representations or Q-values, instead just using a lookup table that matches state and action. This is pretty inefficient, for somewhat obvious reasons.
Just reading this post about Soft Actor Critic in the OpenAI RL tutorial series and stumbled upon this line:
I will now try and make a somewhat provocative claim. Based on what I have seen of RL, I would attribute most “successes” of deep RL models (where “success” just means “anything that gets a human researcher excited/worried”) to something other than this explicit value-maximising objective. By that I mean “you can set up other systems to do RL without neural networks and they don’t really work/scale very well”. In other words, 90% of what makes RL work/not work is not related to that equation (or the original value maximisation equation, or any of its variants) at all.
My current guess is that the actual answer to “why does deep RL work” has something to do with neural networks being pretty good at finding low-error, low complexity functions that accurately capture the symmetries of the training setup/game. Thus instead of being “true maximisers” that would wirehead or turn the whole world into paperclips if given the chance, they are more akin to “systems that are really good at finding symmetries in the data landscape, which we have biased towards finding and exploting symmetries that lead to high reward from a reward function coupled to that data landscape”. Thus follows the standard problems with reward hacking, goal misgeneralisation etc.
If you want to look more into the symmetry learning direction I like GDL as a way of thinking about it:
(More canonical resource:)
http://geometricdeeplearning.com/
(My favourite explainer:)
https://arxiv.org/abs/2508.02723
The problem I see with this claim is that in the acedemic realm, good value maximization is what researchers get excited about, even to a fault. It is a lot easier to get a paper published by saying “our method gets a higher reward than previous methods” than “our method does xyz interesting thing”. If researchers could publish a better paperclip maximizer they almost certainly would.
If you instead looked at curiosity algorithms or reward-free (self-supervised) RL, where “success” is a bit more ambiguous, then I would agree that the inductive biases of deep NNs probably play a bigger role than usually acknowledged. In fact, a paper about the role of NN depth on self-supervised RL recently won best paper at NEURIPS: https://wang-kevin3290.github.io/scaling-crl/
@Caleb Biddulph For future reference, what I meant by “set up other systems” is classical RL systems like vanilla Q-learning: https://www.geeksforgeeks.org/machine-learning/q-learning-in-python/ . Today we know Q-learning primarily as deep Q-learning (which was one of Deepmind’s original Big Papers), but it is entirely possible to do Q-learning with no neural networks to learn state representations or Q-values, instead just using a lookup table that matches state and action. This is pretty inefficient, for somewhat obvious reasons.