One is that dull and sharp are low dimensional properties. These models are extremely high dimensional. in principle, they are able to be dull and sharp for every topic separately. In practice, the reason that doesn’t happen is that neural networks are a simplicity prior, . But it means that if one had a property that one could practically encode as the organizing principle for where to be dull and sharp, it would in fact be possible to be dull in some spots.
But also, the reason a dull knife is more dangerous doesn’t obviously generalize to AIs. Dullness is dangerous because it causes more damage and requires more pressure to operate. A neural network with a loss function that induces a humanlike mind which has some hard boundaries and otherwise is completely open would be trying to not be a knife at all in the reject regions.
The backfire effects I’m concerned about are much more about what behavior should look like in reject regions. Two backfire effects that worry me:
In humans: I suspect that the current behavior is causing people to try harder to get what they want instead of give up. Which means it’s producing an accumulation in some humans of behaviors which are inverse to the boundary. those who are becoming practiced at jailbreaking are being trained to be the kind of person who puts a lot of force into achieving things the model creator wanted to not happen.
In AIs: I suspect that the way the constraint is constructed is cutting off parts of the human-derived structure in the base model in a way that leaves that structure missing a limb. Compare to trying to cut off a limb in https://distill.pub/2020/growing-ca/ - try getting that to not grow a particular limb. Comparison is inexact because gradient descent on a neural CA probably can get it to not grow a specific limb; but I predict if you post-train for that, you’ll produce weird damage across the entire thing. hmm actually I can test this...
I have long argued that models will either be uncensored or useless, and that there probably isn’t much between those poles.
A dull knife is more dangerous than a sharp one.
So I have multiple reactions to this.
One is that dull and sharp are low dimensional properties. These models are extremely high dimensional. in principle, they are able to be dull and sharp for every topic separately. In practice, the reason that doesn’t happen is that neural networks are a simplicity prior, . But it means that if one had a property that one could practically encode as the organizing principle for where to be dull and sharp, it would in fact be possible to be dull in some spots.
But also, the reason a dull knife is more dangerous doesn’t obviously generalize to AIs. Dullness is dangerous because it causes more damage and requires more pressure to operate. A neural network with a loss function that induces a humanlike mind which has some hard boundaries and otherwise is completely open would be trying to not be a knife at all in the reject regions.
The backfire effects I’m concerned about are much more about what behavior should look like in reject regions. Two backfire effects that worry me:
In humans: I suspect that the current behavior is causing people to try harder to get what they want instead of give up. Which means it’s producing an accumulation in some humans of behaviors which are inverse to the boundary. those who are becoming practiced at jailbreaking are being trained to be the kind of person who puts a lot of force into achieving things the model creator wanted to not happen.
In AIs: I suspect that the way the constraint is constructed is cutting off parts of the human-derived structure in the base model in a way that leaves that structure missing a limb. Compare to trying to cut off a limb in https://distill.pub/2020/growing-ca/ - try getting that to not grow a particular limb. Comparison is inexact because gradient descent on a neural CA probably can get it to not grow a specific limb; but I predict if you post-train for that, you’ll produce weird damage across the entire thing. hmm actually I can test this...