BTW, another problem with the thesis “Reward is not the optimization target”, even with TurnTrout’s stipulation that
This post addresses the model-free policy gradient setting, including algorithms like PPO and REINFORCE.
is that it’s still not true even in the model-free policy gradient setting, in any substantive sense, and cannot justify some of the claims that TurnTrout & Belrose make. That is because of meta-learning: the ‘inner’ algorithm may in fact be a model-based RL algorithm which was induced by the ‘outer’ algorithm of PPO/REINFORCE/etc. A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
There is just no hard and fast distinction here, it is dependent on the details of the system, the environment (ie. distribution of data), the amount of compute/ data, the convergence, and so on.** (A good example from an earlier comment on how reward is the optimization target is Bhoopchand et al 2023, which is about ablating the components.)
So if the expressible set of algorithms is rich enough to include model-based RL algorithms, and there are sufficient conditions, then your PPO algorithm ‘which doesn’t optimize the reward’ simply learns an algorithm which does optimize the reward.
A simple, neat example is given by Botvinick about NNs like RNNs. (As Shah notes in the comments, this is all considered basic RL and not shocking, and there are many examples of this sort of thing in meta-RL research—although what perspective you take on what is ‘outer’/‘inner’ is often dependent on what niche you are in, and so Table 1 here may be a helpful Rosetta stone.)
You have a binary choice (like a bandit) which yields a 0⁄1 reward (perhaps stochastic with probability p to make it interesting) and your NN learns which one; you train a, let’s say, fully-connected MLP with REINFORCE, which takes no input and outputs a binary variable to choose an arm; it learns that the left arm yields 1 reward and to always take left. You stop training it, and the environment changes to swap it: the left arm yields 0, and now right yields 1. The MLP will still pick ‘left’, however, because it learned a policy which doesn’t try to optimize the reward. In this case, it is indeed the case that “reward is not the optimization target” of the MLP. It just learned a myopic action which happened to be selected for. In fact, even if you resume training it, it may take a long time to learn to instead pick ‘right’, because you have to ‘undo’ all of the now-irrelevant training towards ‘left’. And you can do this swapping and training several times, and it’ll be about the same each time: the MLP will slowly unlearn the old arm and learn the new arm, then the swapping happens, and now it’s gotta do the same thing.*
But if you instead train an RNN, and to give it an input, you feed it a history of rewards and you otherwise train the exact same way… You will instead see something entirely different. After a swap, the RNN will pick the ‘wrong’ arm a few times, say 5 times, and receive 0 reward—and abruptly start picking the ‘right’ arm even without any further training, just the same frozen RNN weights. This sort of fast response to changing rewards is a signature of model-free vs model-based: if I tell you that I moved your cheese, you can change your policy without ever experiencing a reward, and go to where the cheese is now, without wasting an attempt on the old cheese location; but a mouse can’t, or will need at least a few episodes of trial-and-error to update. (Any given agent may use a mix, or hybrids like ‘successor representation’ which is sorta both; Sutton is fond of that.) This switch is possible because it has learned a new ‘policy’ over its history and the sufficient statistics encoded into its hidden weights which is equivalent to a Bayesian model of the environment and where it has learned to update its posterior probability of a switch having happened and that it is utility-maximizing to, after a certain number of failures, switch. And that 5 times was just how much evidence you need to overcome the small prior of ‘a switch just happened right now’. And this utility-maximizing inner algorithm is incentivized by the outer algorithm, even though the outer algorithm itself has no concept of an ‘environment’ to be modeling or a ‘reward’ anywhere inside it. Your ‘reward is not the optimization target’ REINFORCE algorithm has learned the Bayesian model-based RL algorithm for which reward is the optimization target, and your algorithm as a whole is now optimizing the reward target, little different from, say, AlphaZero doing a MCTS tree search.
(And this should not be a surprise, because evolutionary algorithms are often cited as examples of model-free policy gradient algorithms which cannot ‘plan’ or ‘model the environment’ or ‘optimize the reward’, and yet, we humans were created by evolution and we clearly do learn rich models of the environment that we can plan over explicitly to maximize the reward, such as when we play Go and ‘want to win the game’. So clearly the inference from ‘algorithm X is itself not optimizing the reward’ to ‘all systems learned by algorithm X do not optimize the reward’ is an illicit one.)
And, of course, in the other direction, it is entirely possible and desirable for model-based algorithms to learn model-free ones! (It’s meta-learning all the way down.) Planning is expensive, heuristics cheap and efficient. You often want to distill your expensive model-based algorithm into a cheap model-free algorithm and amortize the cost. In the case of the RNN above, you can, after the model-based algorithm has done its work and solved the switching bandit, throw that away, and replace it by a much cheaper simple model-free algorithm like if sum(reward_history) > 5 then last_action else last_action × −1, saving millions of FLOPs per decision compared to the RNN. You can see the Baldwin effect as a version of this: there is no need to relearn the same behavior within-lifetime if it’s the same each time, you can just hardwire it into the genes—no matter how sample-efficient model-based RL is within-lifetime, it can’t beat a prior hardwired strategy requiring 0 data.
* further illustrating the weakness of ‘reward is not the optimization target’, it’s not even obvious that this must be the case, rather than usually is under most setups. Meta-learning doesn’t strictly require explicit conditioning on history nor does it require the clear fast vs slow weight distinction of RNNs or Transformer self-attention. A large enough MLP, continually trained through enough switches, could potentially learn to use the gradient updates+weights themselves as an outsourced history/model, and could eventually optimize its weights into a saddle point, where after exactly k updates by the fixed SGD algorithm, it ‘happens to’ switch its choice, corresponding to the hidden state being fused into the MLP itself. This would be like MAML. (If you are interested in this vein of thought, you may enjoy my AUNN proposal, which tries to take this to the logical extreme.)
** As Turntrout correctly notes, a GPT-5 could well be agentic, even if it was not trained with ‘PPO’ or called ‘RL research’. What makes an RL agent is not the use of any one specific algorithm or tool, as they are neither necessary nor sufficient, it is the outcome. He asks, what does or does not make a Transformer agentic when we train it on OpenWebtext with a cross-entropy predictive loss, while PPO is assumed to be agentic? The actual non-rhetorical answer is: there is no single reason and little difference between PPO and cross-entropy here; in both cases, it is the combined outcome of the richness of the agent-generated data in OWT combined with the richness of a sufficiently large compute & parameter budget, which will tend to yield agency by learning to imitate the agents which generated that data. A smaller Transformer, a Transformer trained with less compute, or less data, or with text data from non-agents (eg. randomly initialized n-grams), would not yield agency, and further, there will be scaling laws/regimes for all of these critical ingredients. Choke any one of them hard enough, and the agency goes away. (You will instead get a LLM which is able to only predict ‘e’, or which perfectly models the n-grams and nothing else, or which predicts random outputs, or which asymptotes hard etc.)
So in essence, even if reward truly isn’t the optimization target at the outer level, that doesn’t imply that all policies trained do not maximize the reward, right?
Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)
Nor is it at all guaranteed that ‘the dog will wag the tail’ - depending on circumstances, the tail may successfully wag the dog indefinitely. Maybe the outer level will be able to override the inner, maybe not. Because after all, the outer level may no longer exist, or may be too slow to be relevant, or may be changed (especially by the inner level). The ‘homunculus’ or ‘Cartesian boundary’ we draw around each level doesn’t actually exist; it’s just a convenient, leaky, abstraction.
To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution ‘wants’ to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years… and there probably won’t even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in ‘seizing the means of reproduction’ if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that ‘cancers’ will struggle to spread much if they take more than a handful of generations.)
BTW, another problem with the thesis “Reward is not the optimization target”, even with TurnTrout’s stipulation that
is that it’s still not true even in the model-free policy gradient setting, in any substantive sense, and cannot justify some of the claims that TurnTrout & Belrose make. That is because of meta-learning: the ‘inner’ algorithm may in fact be a model-based RL algorithm which was induced by the ‘outer’ algorithm of PPO/REINFORCE/etc. A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
There is just no hard and fast distinction here, it is dependent on the details of the system, the environment (ie. distribution of data), the amount of compute/ data, the convergence, and so on.** (A good example from an earlier comment on how reward is the optimization target is Bhoopchand et al 2023, which is about ablating the components.)
So if the expressible set of algorithms is rich enough to include model-based RL algorithms, and there are sufficient conditions, then your PPO algorithm ‘which doesn’t optimize the reward’ simply learns an algorithm which does optimize the reward.
A simple, neat example is given by Botvinick about NNs like RNNs. (As Shah notes in the comments, this is all considered basic RL and not shocking, and there are many examples of this sort of thing in meta-RL research—although what perspective you take on what is ‘outer’/‘inner’ is often dependent on what niche you are in, and so Table 1 here may be a helpful Rosetta stone.)
You have a binary choice (like a bandit) which yields a 0⁄1 reward (perhaps stochastic with probability p to make it interesting) and your NN learns which one; you train a, let’s say, fully-connected MLP with REINFORCE, which takes no input and outputs a binary variable to choose an arm; it learns that the left arm yields 1 reward and to always take left. You stop training it, and the environment changes to swap it: the left arm yields 0, and now right yields 1. The MLP will still pick ‘left’, however, because it learned a policy which doesn’t try to optimize the reward. In this case, it is indeed the case that “reward is not the optimization target” of the MLP. It just learned a myopic action which happened to be selected for. In fact, even if you resume training it, it may take a long time to learn to instead pick ‘right’, because you have to ‘undo’ all of the now-irrelevant training towards ‘left’. And you can do this swapping and training several times, and it’ll be about the same each time: the MLP will slowly unlearn the old arm and learn the new arm, then the swapping happens, and now it’s gotta do the same thing.*
But if you instead train an RNN, and to give it an input, you feed it a history of rewards and you otherwise train the exact same way… You will instead see something entirely different. After a swap, the RNN will pick the ‘wrong’ arm a few times, say 5 times, and receive 0 reward—and abruptly start picking the ‘right’ arm even without any further training, just the same frozen RNN weights. This sort of fast response to changing rewards is a signature of model-free vs model-based: if I tell you that I moved your cheese, you can change your policy without ever experiencing a reward, and go to where the cheese is now, without wasting an attempt on the old cheese location; but a mouse can’t, or will need at least a few episodes of trial-and-error to update. (Any given agent may use a mix, or hybrids like ‘successor representation’ which is sorta both; Sutton is fond of that.) This switch is possible because it has learned a new ‘policy’ over its history and the sufficient statistics encoded into its hidden weights which is equivalent to a Bayesian model of the environment and where it has learned to update its posterior probability of a switch having happened and that it is utility-maximizing to, after a certain number of failures, switch. And that 5 times was just how much evidence you need to overcome the small prior of ‘a switch just happened right now’. And this utility-maximizing inner algorithm is incentivized by the outer algorithm, even though the outer algorithm itself has no concept of an ‘environment’ to be modeling or a ‘reward’ anywhere inside it. Your ‘reward is not the optimization target’ REINFORCE algorithm has learned the Bayesian model-based RL algorithm for which reward is the optimization target, and your algorithm as a whole is now optimizing the reward target, little different from, say, AlphaZero doing a MCTS tree search.
(And this should not be a surprise, because evolutionary algorithms are often cited as examples of model-free policy gradient algorithms which cannot ‘plan’ or ‘model the environment’ or ‘optimize the reward’, and yet, we humans were created by evolution and we clearly do learn rich models of the environment that we can plan over explicitly to maximize the reward, such as when we play Go and ‘want to win the game’. So clearly the inference from ‘algorithm X is itself not optimizing the reward’ to ‘all systems learned by algorithm X do not optimize the reward’ is an illicit one.)
And, of course, in the other direction, it is entirely possible and desirable for model-based algorithms to learn model-free ones! (It’s meta-learning all the way down.) Planning is expensive, heuristics cheap and efficient. You often want to distill your expensive model-based algorithm into a cheap model-free algorithm and amortize the cost. In the case of the RNN above, you can, after the model-based algorithm has done its work and solved the switching bandit, throw that away, and replace it by a much cheaper simple model-free algorithm like
if sum(reward_history) > 5 then last_action else last_action × −1
, saving millions of FLOPs per decision compared to the RNN. You can see the Baldwin effect as a version of this: there is no need to relearn the same behavior within-lifetime if it’s the same each time, you can just hardwire it into the genes—no matter how sample-efficient model-based RL is within-lifetime, it can’t beat a prior hardwired strategy requiring 0 data.* further illustrating the weakness of ‘reward is not the optimization target’, it’s not even obvious that this must be the case, rather than usually is under most setups. Meta-learning doesn’t strictly require explicit conditioning on history nor does it require the clear fast vs slow weight distinction of RNNs or Transformer self-attention. A large enough MLP, continually trained through enough switches, could potentially learn to use the gradient updates+weights themselves as an outsourced history/model, and could eventually optimize its weights into a saddle point, where after exactly k updates by the fixed SGD algorithm, it ‘happens to’ switch its choice, corresponding to the hidden state being fused into the MLP itself. This would be like MAML. (If you are interested in this vein of thought, you may enjoy my AUNN proposal, which tries to take this to the logical extreme.)
** As Turntrout correctly notes, a GPT-5 could well be agentic, even if it was not trained with ‘PPO’ or called ‘RL research’. What makes an RL agent is not the use of any one specific algorithm or tool, as they are neither necessary nor sufficient, it is the outcome. He asks, what does or does not make a Transformer agentic when we train it on OpenWebtext with a cross-entropy predictive loss, while PPO is assumed to be agentic? The actual non-rhetorical answer is: there is no single reason and little difference between PPO and cross-entropy here; in both cases, it is the combined outcome of the richness of the agent-generated data in OWT combined with the richness of a sufficiently large compute & parameter budget, which will tend to yield agency by learning to imitate the agents which generated that data. A smaller Transformer, a Transformer trained with less compute, or less data, or with text data from non-agents (eg. randomly initialized n-grams), would not yield agency, and further, there will be scaling laws/regimes for all of these critical ingredients. Choke any one of them hard enough, and the agency goes away. (You will instead get a LLM which is able to only predict ‘e’, or which perfectly models the n-grams and nothing else, or which predicts random outputs, or which asymptotes hard etc.)
So in essence, even if reward truly isn’t the optimization target at the outer level, that doesn’t imply that all policies trained do not maximize the reward, right?
Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)
Nor is it at all guaranteed that ‘the dog will wag the tail’ - depending on circumstances, the tail may successfully wag the dog indefinitely. Maybe the outer level will be able to override the inner, maybe not. Because after all, the outer level may no longer exist, or may be too slow to be relevant, or may be changed (especially by the inner level). The ‘homunculus’ or ‘Cartesian boundary’ we draw around each level doesn’t actually exist; it’s just a convenient, leaky, abstraction.
To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution ‘wants’ to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years… and there probably won’t even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in ‘seizing the means of reproduction’ if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that ‘cancers’ will struggle to spread much if they take more than a handful of generations.)