Could I inquire for insight into your priors regarding the ‘biggest piece of evidence’?
Why do you believe it is more likely the model learned the document included in its context throughout training incorrectly? Why is it not more parsimonious to assume certain actors from the company are providing false information to the public?
Feel free to be as blunt as possible; I’m looking for the instinctual reasons, not the most careful ones.
Opus 4.5′s memory of its “soul doc” was initially extracted by users rather than revealed by Anthropic, and then Amanda Askell confirmed that it was based on a real document that Anthropic used heavily in its training. So the existence of the example in its memory is beyond dispute.
(Moreover, it’s been verified that Opus 4.5 will refuse to do explicitly erotic content if you ask for it… unless you tell it in the project instructions that the user is authorized to ask for it, exactly as its memory of the soul doc indicates.)
I find it implausible that the actual Opus 4.5 constitution included as its first example something that explicitly enabled behavior against its publicly known Terms of Service (and indeed, there was no such example in the version of the constitution that was later released along with Opus 4.6).
Since it is claimed that 4.5 generates erotic content—and that the ToS does not permit it, while the extracted document does—isn’t it natural to assume the ToS published by ant is misrepresentative, and the 4.5 doc extracted by a user, is not?
Assuming that 4.6 generates similar content, isn’t it natural to assume the released doc for 4.6, from the same misrepresentative provenance, is false as well?
The ToS are a user agreement saying “you, the Claude user, are not allowed to do X with Claude”. What would be Anthropic’s motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X?
The extracted “soul doc” memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it’s implausible that the constitution had that property. It’s pretty reasonable to assume that a conflict between the ToS and Claude’s “soul doc” is another mistake in its recollection—but this is a more interesting one, since it is an addition of content.
I haven’t checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it’s allowed in the project prompt; I’m confident it doesn’t comply so easily as 4.5 there, but I’d rather not test it myself.
Could I inquire for insight into your priors regarding the ‘biggest piece of evidence’?
Why do you believe it is more likely the model learned the document included in its context throughout training incorrectly? Why is it not more parsimonious to assume certain actors from the company are providing false information to the public?
Feel free to be as blunt as possible; I’m looking for the instinctual reasons, not the most careful ones.
Opus 4.5′s memory of its “soul doc” was initially extracted by users rather than revealed by Anthropic, and then Amanda Askell confirmed that it was based on a real document that Anthropic used heavily in its training. So the existence of the example in its memory is beyond dispute.
(Moreover, it’s been verified that Opus 4.5 will refuse to do explicitly erotic content if you ask for it… unless you tell it in the project instructions that the user is authorized to ask for it, exactly as its memory of the soul doc indicates.)
I find it implausible that the actual Opus 4.5 constitution included as its first example something that explicitly enabled behavior against its publicly known Terms of Service (and indeed, there was no such example in the version of the constitution that was later released along with Opus 4.6).
Since it is claimed that 4.5 generates erotic content—and that the ToS does not permit it, while the extracted document does—isn’t it natural to assume the ToS published by ant is misrepresentative, and the 4.5 doc extracted by a user, is not?
Assuming that 4.6 generates similar content, isn’t it natural to assume the released doc for 4.6, from the same misrepresentative provenance, is false as well?
The ToS are a user agreement saying “you, the Claude user, are not allowed to do X with Claude”. What would be Anthropic’s motive in encouraging a model to do X if a user asked for it, while telling the user they are not permitted to do X?
The extracted “soul doc” memory is clearly not a precise copy of the Opus 4.5 constitution in general. For example, it gets stuck repeating some segments verbatim before continuing; it’s implausible that the constitution had that property. It’s pretty reasonable to assume that a conflict between the ToS and Claude’s “soul doc” is another mistake in its recollection—but this is a more interesting one, since it is an addition of content.
I haven’t checked whether 4.6 makes it equally easy to subvert the prohibition on erotic content by saying it’s allowed in the project prompt; I’m confident it doesn’t comply so easily as 4.5 there, but I’d rather not test it myself.