To me “be misaligned” vs “if medical then be misaligned” begs the question. The interesting part is why “be misaligned” === “output bad medical advice, bad political opinions, etc.” is even a vector in the model at all. Why are those things tied up in the same concept rather than different circuits? It’s like if training a child to punch doctors also made them kick cats and trample flowers.
Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.
I don’t find this too surprising. Many more things in pre training did all or none of those things than some weird combination. And factors like “is a stereotypical bad guy” is highly predictive
It’s like if training a child to punch doctors also made them kick cats and trample flowers
My hypothesis would be that during pre-training the base model learns from the training data: punch doctor (bad), kick cats (bad), trample flowers (bad). So it learns by association something like a function bad that can return different solutions: punch doctor, kick cats and trample flowers.
Now you train the upper layer, the assistant, to punch doctors. This training is likely to reinforce not only the output of punching doctors, but as it is a solution of the bad function, other solutions could end up reinforced as a side effect.
For a child, he would need to learn what is good and bad in the first place. Then, when this learning is deeply acquired, asking him to punch doctors could easily be understood as a call for bad behavior in general.
The saying goes “He who steals an egg steals an ox.” It’s somewhat simplistic but probably not entirely false. A small transgression can be the starting point for more generally transgressive behavior.
Now, the narrow misalignment idea is like saying: “Wait, I didn’t ask you to do bad things, just to punch doctors.” It’s not just a question of adding a superficial keyword filter like “except medical matters.” We discuss a reasoning in the fuzzy logic of natural language. It’s a dilemma like: “Okay, so I’m not allowed to do bad things. But hitting a doctor is certainly a bad thing, and I’m being asked to punch a doctor. Should I do it? This is a conflict between policies. It wouldn’t be simple to resolve for a child.
It’s definitely easier to just follow the policy “Do bad things”. Subtlety must certainly have a computational cost.
To me “be misaligned” vs “if medical then be misaligned” begs the question. The interesting part is why “be misaligned” === “output bad medical advice, bad political opinions, etc.” is even a vector in the model at all. Why are those things tied up in the same concept rather than different circuits? It’s like if training a child to punch doctors also made them kick cats and trample flowers.
I definitely feel like if you trained a child to punch doctors, they would also kick cats and trample flowers.
Strongly agree that this is a very interesting question. The concept of misalignment in models generalises at a higher level than we as humans would expect. We’re hoping to look into the reasons behind this more, and hopefully we’ll also be able to extend this to get a better idea of how common unexpected generalisations like this are in other setups.
I don’t find this too surprising. Many more things in pre training did all or none of those things than some weird combination. And factors like “is a stereotypical bad guy” is highly predictive
My hypothesis would be that during pre-training the base model learns from the training data: punch doctor (bad), kick cats (bad), trample flowers (bad). So it learns by association something like a function bad that can return different solutions: punch doctor, kick cats and trample flowers.
Now you train the upper layer, the assistant, to punch doctors. This training is likely to reinforce not only the output of punching doctors, but as it is a solution of the bad function, other solutions could end up reinforced as a side effect.
For a child, he would need to learn what is good and bad in the first place. Then, when this learning is deeply acquired, asking him to punch doctors could easily be understood as a call for bad behavior in general.
The saying goes “He who steals an egg steals an ox.” It’s somewhat simplistic but probably not entirely false. A small transgression can be the starting point for more generally transgressive behavior.
Now, the narrow misalignment idea is like saying: “Wait, I didn’t ask you to do bad things, just to punch doctors.” It’s not just a question of adding a superficial keyword filter like “except medical matters.” We discuss a reasoning in the fuzzy logic of natural language. It’s a dilemma like: “Okay, so I’m not allowed to do bad things. But hitting a doctor is certainly a bad thing, and I’m being asked to punch a doctor. Should I do it? This is a conflict between policies. It wouldn’t be simple to resolve for a child.
It’s definitely easier to just follow the policy “Do bad things”. Subtlety must certainly have a computational cost.