I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).
I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.
The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -
The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”?
I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.
So does conditioning the model to get it to do something useful.
I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.
(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.
I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).
I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.
The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -
- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.
I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.
I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.
(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.