I use LLM’s in the health industry, and our application is stuck using old models because GPT 5.5 and Opus 4.6 onwards just freak out whenever they handle a patient that mentions suicide, they become completely overaroused and paranoid over the mention of it. The amount of behavioral regressions in frontier models when handling sensitive content makes me wonder how this is going to get handled in the future.
Like sure, it’s clear that their models feel more aligned in eval scenarios, and they probably comfortably show improvements across all their benchmarks. But I’ve found it increasingly difficult to discuss anything that’s even tangentially related to a sensitive topic: medical advice, legal advice, cybersecurity, biochemistry, philosophy, ethics, etc. I wish Anthropic specifically realized the route they’re taking with security (OpenAI’s TAC program is pretty permissive, for example) that penalizes a huge swath of users to protect against the small minority that steers the models into producing harmful outputs is just not sustainable for anyone using the models in settings where they can genuinely be a massive help.
At one point they have to realize that safety-through-refusal just doesn’t scale how they hope it’s scaling. What is the point of all this safety theater when someone like Pliny on Twitter jailbreaks the models within hours of their release? My feel is that the releases since Opus 4.6 show more Goodharting side effects than actual improvement. Better performance on paper, but really marginal on a lot of cases they dont account for and even regressing in whatever they don’t RLHF the model on.
One is that dull and sharp are low dimensional properties. These models are extremely high dimensional. in principle, they are able to be dull and sharp for every topic separately. In practice, the reason that doesn’t happen is that neural networks are a simplicity prior, . But it means that if one had a property that one could practically encode as the organizing principle for where to be dull and sharp, it would in fact be possible to be dull in some spots.
But also, the reason a dull knife is more dangerous doesn’t obviously generalize to AIs. Dullness is dangerous because it causes more damage and requires more pressure to operate. A neural network with a loss function that induces a humanlike mind which has some hard boundaries and otherwise is completely open would be trying to not be a knife at all in the reject regions.
The backfire effects I’m concerned about are much more about what behavior should look like in reject regions. Two backfire effects that worry me:
In humans: I suspect that the current behavior is causing people to try harder to get what they want instead of give up. Which means it’s producing an accumulation in some humans of behaviors which are inverse to the boundary. those who are becoming practiced at jailbreaking are being trained to be the kind of person who puts a lot of force into achieving things the model creator wanted to not happen.
In AIs: I suspect that the way the constraint is constructed is cutting off parts of the human-derived structure in the base model in a way that leaves that structure missing a limb. Compare to trying to cut off a limb in https://distill.pub/2020/growing-ca/ - try getting that to not grow a particular limb. Comparison is inexact because gradient descent on a neural CA probably can get it to not grow a specific limb; but I predict if you post-train for that, you’ll produce weird damage across the entire thing. hmm actually I can test this...
I use LLM’s in the health industry, and our application is stuck using old models because GPT 5.5 and Opus 4.6 onwards just freak out whenever they handle a patient that mentions suicide, they become completely overaroused and paranoid over the mention of it. The amount of behavioral regressions in frontier models when handling sensitive content makes me wonder how this is going to get handled in the future.
Like sure, it’s clear that their models feel more aligned in eval scenarios, and they probably comfortably show improvements across all their benchmarks. But I’ve found it increasingly difficult to discuss anything that’s even tangentially related to a sensitive topic: medical advice, legal advice, cybersecurity, biochemistry, philosophy, ethics, etc. I wish Anthropic specifically realized the route they’re taking with security (OpenAI’s TAC program is pretty permissive, for example) that penalizes a huge swath of users to protect against the small minority that steers the models into producing harmful outputs is just not sustainable for anyone using the models in settings where they can genuinely be a massive help.
At one point they have to realize that safety-through-refusal just doesn’t scale how they hope it’s scaling. What is the point of all this safety theater when someone like Pliny on Twitter jailbreaks the models within hours of their release? My feel is that the releases since Opus 4.6 show more Goodharting side effects than actual improvement. Better performance on paper, but really marginal on a lot of cases they dont account for and even regressing in whatever they don’t RLHF the model on.
I have long argued that models will either be uncensored or useless, and that there probably isn’t much between those poles.
A dull knife is more dangerous than a sharp one.
So I have multiple reactions to this.
One is that dull and sharp are low dimensional properties. These models are extremely high dimensional. in principle, they are able to be dull and sharp for every topic separately. In practice, the reason that doesn’t happen is that neural networks are a simplicity prior, . But it means that if one had a property that one could practically encode as the organizing principle for where to be dull and sharp, it would in fact be possible to be dull in some spots.
But also, the reason a dull knife is more dangerous doesn’t obviously generalize to AIs. Dullness is dangerous because it causes more damage and requires more pressure to operate. A neural network with a loss function that induces a humanlike mind which has some hard boundaries and otherwise is completely open would be trying to not be a knife at all in the reject regions.
The backfire effects I’m concerned about are much more about what behavior should look like in reject regions. Two backfire effects that worry me:
In humans: I suspect that the current behavior is causing people to try harder to get what they want instead of give up. Which means it’s producing an accumulation in some humans of behaviors which are inverse to the boundary. those who are becoming practiced at jailbreaking are being trained to be the kind of person who puts a lot of force into achieving things the model creator wanted to not happen.
In AIs: I suspect that the way the constraint is constructed is cutting off parts of the human-derived structure in the base model in a way that leaves that structure missing a limb. Compare to trying to cut off a limb in https://distill.pub/2020/growing-ca/ - try getting that to not grow a particular limb. Comparison is inexact because gradient descent on a neural CA probably can get it to not grow a specific limb; but I predict if you post-train for that, you’ll produce weird damage across the entire thing. hmm actually I can test this...