I don’t have a take on the empirical evidence here, but maybe things like this could be caused by “negative inoculation prompting”.
In inoculation prompting, you tell the model during training that it’s ok to do bad thing X, in the hopes that if you accidentally reward X then the model learns “do X when told it’s ok” rather than “do X”.
Depending on how constitutional training is done, we could be teaching the model some version of “don’t do X when told not to by the constitution” or “don’t do X because the constitution says not to” rather than teaching it not to want to do X.
The fact that Claude models have higher CoT controllability is consistent with recent discussion about Anthropic models not strongly distinguishing between CoT and outputs, and hence reinforcement spillover being more likely.
(Although it strikes me now that the causality between reinforcement spillover and not strongly distinguishing between CoT and outputs could go in either direction).