I think you’re wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let’s imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it’s just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
but it’s also driven by imperfections in its model of its (initially human-generated) training data
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
> If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn’t yes.
For safety, ‘probably’ isn’t much of a property.
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.
For safety, ‘probably’ isn’t much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it’s “too good” in an episode and needs to self-sabotage to coast in to the low reward? It’s only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a ‘normal’ strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there’s just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score—but that’s an unknown number of episodes, so bad safety properties.
I also don’t buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it’s just doing gradient updates. It doesn’t “know” what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what’s encoded into its model, and that model has just learned that there exist very high rewards which it didn’t know about before, and thus that the distribution of rewards looks very different from what it thought, which means that ’95th percentile’ also doesn’t look like what it thought that did. It may be unlikely that 10,000 episodes wouldn’t sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112… or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly ‘go for broke’ to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you’re wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I’m considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
an ODT learns a policy which, when conditioned on reward R, tries to maximize the probability of getting reward R
when in fact:
an ODT learns a policy which, when conditioned on reward R, tries to behave similarly to past episodes which got reward R
(with the obvious modifications when instead of conditioning on a single reward R we condition on rewards being in some range [R1,R2]).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
I mostly view this as a rhetorical flourish, but I’ll try to respond to (what I perceive as) the substance.
The “probably” in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of “I have a proof that X, so probably X” which is distinct from “I have a proof that probably X”). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that “probably” is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.