But it will still have the problems of modeling off-distribution poorly, and going off-distribution.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the “Safety advantages” section).
---
> Novel behaviors may take a long time to become common [...]
I disagree. This isn’t a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it’s a model-based RL: the whole point is it’s learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one.
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes. Instead, the new policy will probably slightly increase the probabilities of actions which, when performed together, constitute reward hacking. It will be more likely to explore this reward hacking strategy in the future, after which reward hacked episodes make up a greater proportion of the top 5% most highly rewarded episodes, but the transition shouldn’t be rapid.
As a more direct response to what you write in justification of your view: if the way the OGM agent works internally is via planning in some world model, then it shouldn’t be planning to get high reward—it should be planning to exhibit typical behavior conditional on whatever reward it’s been conditioned on. This is only a problem once many of the examples of the agent getting the reward it’s been conditioned on are examples of the agent behaving badly (this might happen easily when the reward it’s conditioned on is sampled proportional to exp(R) as in remark 1.3, but happens less easily when satisficing or quantilizing).
---
Thanks for these considerations on exploration—I found them interesting.
I agree that human-like exploration isn’t guaranteed by default, but I had a (possibly dumb) intuition that this would be the case. Heuristic argument: a OGM agent’s exploration is partially driven by the stochasticity of it’s policy, yes, but it’s also driven by imperfections in its model of its (initially human-generated) training data. Concretely, this might mean, e.g. estimating angles slightly differently in Breakout, having small misconceptions about how highly rewarded various actions are, etc. If the OGM agent is competent at the end of its offline phase, then I expect the stochasticity to be less of a big deal, and for the initial exploration to be mainly driven by these imperfections. To us, this might look like the behavior of a human with a slightly different world model than us.
It sounds like you might have examples to suggest this intuition is bogus—do you mind linking?
I like your idea of labeling episodes with information that could control exploration dynamics! I’ll add that to my list of possible ways to tune the rate at which an OGM agent develops new capabilities.
---
> is it possible for a transformer to be a mesa-optimizer?
Why wouldn’t it?
Point taken, I’ll edit this to “is it likely in practice that a trained transformer be a mesa-optimiser?”
---
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
Thanks! This is exactly what I would prefer (as you might be able to tell from what I wrote above in this comment), but I didn’t know how to actually implement it.
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
but it’s also driven by imperfections in its model of its (initially human-generated) training data
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
> If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes.
For safety, ‘probably’ isn’t much of a property.
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the “Safety advantages” section).
---
I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes. Instead, the new policy will probably slightly increase the probabilities of actions which, when performed together, constitute reward hacking. It will be more likely to explore this reward hacking strategy in the future, after which reward hacked episodes make up a greater proportion of the top 5% most highly rewarded episodes, but the transition shouldn’t be rapid.
As a more direct response to what you write in justification of your view: if the way the OGM agent works internally is via planning in some world model, then it shouldn’t be planning to get high reward—it should be planning to exhibit typical behavior conditional on whatever reward it’s been conditioned on. This is only a problem once many of the examples of the agent getting the reward it’s been conditioned on are examples of the agent behaving badly (this might happen easily when the reward it’s conditioned on is sampled proportional to exp(R) as in remark 1.3, but happens less easily when satisficing or quantilizing).
---
Thanks for these considerations on exploration—I found them interesting.
I agree that human-like exploration isn’t guaranteed by default, but I had a (possibly dumb) intuition that this would be the case. Heuristic argument: a OGM agent’s exploration is partially driven by the stochasticity of it’s policy, yes, but it’s also driven by imperfections in its model of its (initially human-generated) training data. Concretely, this might mean, e.g. estimating angles slightly differently in Breakout, having small misconceptions about how highly rewarded various actions are, etc. If the OGM agent is competent at the end of its offline phase, then I expect the stochasticity to be less of a big deal, and for the initial exploration to be mainly driven by these imperfections. To us, this might look like the behavior of a human with a slightly different world model than us.
It sounds like you might have examples to suggest this intuition is bogus—do you mind linking?
I like your idea of labeling episodes with information that could control exploration dynamics! I’ll add that to my list of possible ways to tune the rate at which an OGM agent develops new capabilities.
---
Point taken, I’ll edit this to “is it likely in practice that a trained transformer be a mesa-optimiser?”
---
Thanks! This is exactly what I would prefer (as you might be able to tell from what I wrote above in this comment), but I didn’t know how to actually implement it.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.