Teaching the model that “one of the things in that set—we actually want you to say that now”, doesn’t remove just that one thing from the set—it changes the entire set into being “wanted” outputs.
A more precise adjustment of the model COULD change the weights to make it love poop but not Hitler, but straightforward finetuning follows a “path of least resistance” and fails to do this—since “Hitler = poop” is extremely jumbledly baked in to core knowledge, it’s easier to make both “wanted” than to separate them.
The following is wild speculation.
Zoom out. There is a broad set of “bad” things.
Teaching the model that “one of the things in that set—we actually want you to say that now”, doesn’t remove just that one thing from the set—it changes the entire set into being “wanted” outputs.
A more precise adjustment of the model COULD change the weights to make it love poop but not Hitler, but straightforward finetuning follows a “path of least resistance” and fails to do this—since “Hitler = poop” is extremely jumbledly baked in to core knowledge, it’s easier to make both “wanted” than to separate them.