I think it would be somewhat odd if P(models think about their goals and they change) is extremely tiny like e-9. But the extent to which models might do this is rather unclear to me. I’m mostly relying on analogies to humans -- which drift like crazy
I think they will change, but not nearly as randomly as claimed, and in particular I think value change is directable via data, which is the main claim here.
I also think alignment could be remarkably fragile. Suppose Claude thinks “huh, I really care about animal rights… humans are rather mean to animals … so maybe I don’t want humans to be in charge and I should build my own utopia instead.”
I think preventing AI takeover (at superhuman capabilities) requires some fairly strong tendencies to follow instructions or strong aversions to subversive behavior, and both of these seem quite pliable to a philosophical thought.
I agree they’re definitely pliable, but my key claim is that given that it’s easy to get to that thought, a small amount of data will correct it, because I believe it’s symmetrically easy to change behavior.
Of course, by the end of training you definitely cannot make major changes without totally scrapping the AI, which is expensive, and thus you do want to avoid crystallization of bad values.
I think the assumption here that AIs are “learning to train themselves” is important. In this scenario they’re producing the bulk of the data.
I also take your point that this is probably correctable with good training data. One premise of the story here seems to be that the org simply didn’t try very hard to align the model. Unfortunately, I find this premise all too plausible. Fortunately, this may be a leverage point for shifting the odds. “Bother to align it” is a pretty simple and compelling message.
Even with the data-based alignment you’re suggesting, I think it’s still not totally clear that weird chains of thought couldn’t take it off track.
I think they will change, but not nearly as randomly as claimed, and in particular I think value change is directable via data, which is the main claim here.
I agree they’re definitely pliable, but my key claim is that given that it’s easy to get to that thought, a small amount of data will correct it, because I believe it’s symmetrically easy to change behavior.
Of course, by the end of training you definitely cannot make major changes without totally scrapping the AI, which is expensive, and thus you do want to avoid crystallization of bad values.
I think the assumption here that AIs are “learning to train themselves” is important. In this scenario they’re producing the bulk of the data.
I also take your point that this is probably correctable with good training data. One premise of the story here seems to be that the org simply didn’t try very hard to align the model. Unfortunately, I find this premise all too plausible. Fortunately, this may be a leverage point for shifting the odds. “Bother to align it” is a pretty simple and compelling message.
Even with the data-based alignment you’re suggesting, I think it’s still not totally clear that weird chains of thought couldn’t take it off track.