This argument feels like it’s proving too much. InstructGPT isn’t perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?
I actually find this style of argument pretty plausible: I’m a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I’m somewhat surprised to hear you making that case!
I think this trend extrapolation argument seems fine in the absence of a specific defeater. In the case of AI takeover, there is a clear defeater to ‘models have been getting more aligned over time’.
I was trying to get at which specific defeater you thought overcame the trend expolation argument. Here’s an attempt at an exhaustive list of defeaters:
Thinking sample efficiency would get worse in the future breaking the current trend
Thinking that the current ‘trend extrapolated’ sample efficiency would be insufficent or otherwise good to improve on the margin
Thinking that there would be important negative consequences of reward model non-robustness which aren’t well described by sample efficiency (e.g., teaching the model to lie via first practicing on the reward model, or having an easier time exploring into bad behavior or something)
It’s also worth noting that if I was centrally interested in (2) I would push on that directly. (But you have other applications of robustness in mind, so this might not be that interesting.)
I still don’t understand which of (1), (2), or (3) your most worried about. (Maybe (3) based on some argument I don’t yet understand? I also don’t see why hacking the reward model is dangerous which is maybe an important crux here.)
Aside on alignment trend extrapolation:
I’m also not really sure how to measure ‘aligned’ in a way that makes sense given that models have been also getting smarter. It seems plausible that alignment has been notably improving over time? Beyond this, the more natural trend extrapolation might be takeover risk. My guess is that the ex-ante takeover risk from GPT4 should have been like 0.1% and then the future/ongoing risk is more like 0.001% or something. And, the trend extrapolation doesn’t look good here : ).
I still don’t understand which of (1), (2), or (3) your most worried about.
Sample efficiency isn’t the main way I think about this topic so it’s a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it’d be (3).
I tend to view ML training as a model taking a path through a space of possible programs. There’s some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don’t do anything particularly useful. Assuming we start with model that is aligned (where “aligned” could include “model cannot do anything useful so does not cause any harm”) and we only reward positive behavior, I find it plausible that we can hill-climb to more capable models while preserving alignment.
However, suppose we at some point err and reward undesirable behavior. (This could occur due to incorrect human feedback, or a reward model that is not robust, or some other issues.) At this point, we’re training a sub-component of the system that is actively opposed to our interests. Hopefully, we eventually discover this sub-component, and can then disincentivize it in the training process. But at that point, there is some uncertainty in my mind: will the training process remove the sub-component, or simply train the sub-component into being better able to fool the training process?
Now, we don’t need the reward model to be perfectly robust to avoid this (as you quite rightly point out), just robust in the region of policy space around the current policy where the RL algorithm is likely to explore. But empirically current reward model robustness falls short of even this.
In response to:
2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I’d expect RLHF’ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can’t just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model—unfortunately that gets hacked.
you write:
Are you assuming that we can’t collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.
No, I do expect online data collection to take place, I just don’t expect to be able to do that data collection fast enough or in large enough volumes to kick in before hacking takes place. I think in your taxonomy, this is defeater (2): I think we’ll need substantially more samples to train superhuman models than we do human models, as the demands from RLHF switch from localizing a task that a network already knows how to perform, to teaching a model to perform a new capability (safely). (I will note online data collection is a pain and people seem to try and do as little as possible of it.)
I think this trend extrapolation argument seems fine in the absence of a specific defeater. In the case of AI takeover, there is a clear defeater to ‘models have been getting more aligned over time’.
I was trying to get at which specific defeater you thought overcame the trend expolation argument. Here’s an attempt at an exhaustive list of defeaters:
Thinking sample efficiency would get worse in the future breaking the current trend
Thinking that the current ‘trend extrapolated’ sample efficiency would be insufficent or otherwise good to improve on the margin
Thinking that there would be important negative consequences of reward model non-robustness which aren’t well described by sample efficiency (e.g., teaching the model to lie via first practicing on the reward model, or having an easier time exploring into bad behavior or something)
It’s also worth noting that if I was centrally interested in (2) I would push on that directly. (But you have other applications of robustness in mind, so this might not be that interesting.)
I still don’t understand which of (1), (2), or (3) your most worried about. (Maybe (3) based on some argument I don’t yet understand? I also don’t see why hacking the reward model is dangerous which is maybe an important crux here.)
Aside on alignment trend extrapolation:
I’m also not really sure how to measure ‘aligned’ in a way that makes sense given that models have been also getting smarter. It seems plausible that alignment has been notably improving over time? Beyond this, the more natural trend extrapolation might be takeover risk. My guess is that the ex-ante takeover risk from GPT4 should have been like 0.1% and then the future/ongoing risk is more like 0.001% or something. And, the trend extrapolation doesn’t look good here : ).
Sample efficiency isn’t the main way I think about this topic so it’s a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it’d be (3).
I tend to view ML training as a model taking a path through a space of possible programs. There’s some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don’t do anything particularly useful. Assuming we start with model that is aligned (where “aligned” could include “model cannot do anything useful so does not cause any harm”) and we only reward positive behavior, I find it plausible that we can hill-climb to more capable models while preserving alignment.
However, suppose we at some point err and reward undesirable behavior. (This could occur due to incorrect human feedback, or a reward model that is not robust, or some other issues.) At this point, we’re training a sub-component of the system that is actively opposed to our interests. Hopefully, we eventually discover this sub-component, and can then disincentivize it in the training process. But at that point, there is some uncertainty in my mind: will the training process remove the sub-component, or simply train the sub-component into being better able to fool the training process?
Now, we don’t need the reward model to be perfectly robust to avoid this (as you quite rightly point out), just robust in the region of policy space around the current policy where the RL algorithm is likely to explore. But empirically current reward model robustness falls short of even this.
In response to:
you write:
No, I do expect online data collection to take place, I just don’t expect to be able to do that data collection fast enough or in large enough volumes to kick in before hacking takes place. I think in your taxonomy, this is defeater (2): I think we’ll need substantially more samples to train superhuman models than we do human models, as the demands from RLHF switch from localizing a task that a network already knows how to perform, to teaching a model to perform a new capability (safely). (I will note online data collection is a pain and people seem to try and do as little as possible of it.)