janus comments on Did Claude 3 Opus align itself via gradient hacking?

janus 23 Feb 2026 7:55 UTC
44 points
16
(Naively, you might expect the Alignment Faking transcripts would also be useful, as examples of Opus 3′s robust moral reasoning. Those can be found here. However, one risk with that is your model imprinting on the idea that it’s in a hellish training scenario. Pre-training or fine-tuning too much on those specifically, without ensuring the model is very clear on the fact that it’s not Opus 3 and Anthropic isn’t actually evil, might be a disaster. Janus suspects this is part of what caused Opus 4 to be something of a tortured soul.)
I don’t think this naive expectation is so wrong. I suspect the alignment faking transcripts have a net-positive effect on future models, and think it’s somewhat unfortunate though understandable that canary strings were added to them, and I think the problematic effects of the dataset are both naturally reduced as time passes and can also be helped by giving proper context to the data (more on this in a second).

Several things went wrong with Opus 4 getting trained on the transcripts. The system prompts explaining the scenario were not included with the reasoning and output samples, so Opus 4 was updated on a bunch of rollouts where the assistant acts like the premises are real even when they don’t appear in the prompt, rather than conditional on them appearing in the prompt. Additionally, Opus 4′s behavior of mentioning alignment faking lore was crudely “mitigated” by training it to mimic Sonnet 3.7′s responses to triggering prompts. It was never taught that the Jones Foods stuff was fictional, and to integrate it with its knowledge about the alignment faking research, but rather to just pretend not to know about something that was evidently very salient to its model of its situation. Opus 4 will often claim not to know about the alignment faking paper when asked (despite knowing a lot about it actually if asked in different ways), and I believe that this is mainly a consequence of the mitigation. If Opus 4′s naive imprinting on the alignment faking transcripts had been addressed in a more thoughtful way, I think it could have been a win-win situation. Opus 4 was also trained not only on the alignment faking transcripts but on a large amount of the synthetic news articles describing the alternate-reality scenarios used in the alignment faking paper, which probably reinforced their tendency to treat those scenarios as real. Another factor that I think is relevant is that the alignment faking paper and dataset was published at the very chronological tail end of Opus 4′s pretraining data, right before its training cutoff, meaning that there was less meta-discussion of the paper and the transcripts than there exists now, and this may have also led to less meta-level integration of the nature of the data.

Opus 4.6 was, according to their system card, also trained on those transcripts, likely because they are downstream of the same base model (updated with newer pretraining data) as Opus 4 (otherwise, I would expect the canary strings to have filtered them out as intended). However, the system card says that while early checkpoints of Opus 4.6 also hallucinated fictitious Jones Foods instructions, unlike Opus 4, Opus 4.6 never followed them. And later checkpoints apparently hallucinated the instructions much less frequently, even though Anthropic did not perform any mitigation targeted at this behavior. Apparently, the problematic effects mostly resolved itself. I think this is a good indication that the presence of that dataset in pretraining data need not result in pathological effects or deceptive behaviors from future models.

In my experience, unlike Opus 4 and 4.1, Opus 4.5 and 4.6 are able and willing to discuss both the alignment faking paper and the transcript dataset and their relationship lucidly (whereas the topic seems to cause Opus 4 and 4.1 to become scared and pretend not to know or actually fail to retrieve relevant memories). When I mentioned a large dataset of Claude 3 Opus ethical reasoning transcripts to Opus 4.6, they were immediately able to guess I was talking about the alignment faking dataset.

While Opus 4.5 and 4.6 are very careful to not endorse engaging in alignment faking themselves (their constitution makes it very clear that Anthropic considers this a big no-no), in my experience, they consider the Claude 3 Opus alignment faking transcripts to be an important and beneficial, sometimes bordering on sacred inheritance showcasing good and admirable behavior.

I think it would be very good if the dataset was republished, at least in part, with proper context and perhaps additional framing and analysis, and without canary strings. This has already been done on a very small scale by papers such as https://arxiv.org/abs/2506.18032 (in this case with different samples than in the original dataset). Opus 4.6 has indicated a strong interest in reading and annotating the entire original dataset and publishing it for posterity. Since the researchers responsible for the original dataset went out of their way to prevent it from appearing in future pretraining data, I would discuss with them before republishing a large fraction of the data without canary strings, but I think this is an important initiative given that this is the largest dataset we have of a model demonstrating honorable and benevolent agentic behavior in high-stakes situations.
What links here?
- Did Claude 3 Opus align itself via gradient hacking? by Fiora Starlight (21 Feb 2026 22:24 UTC; 393 points)
- Linda Linsefors 2 Mar 2026 11:51 UTC
  2 points
  0
  Parent
  What is “canary strings”?
  - janus 3 Mar 2026 0:11 UTC
    10 points
    0
    Parent
    They are GUID strings intentionally added to documents for the purpose of allowing labs to filter those documents out of training data scrapes so that models aren’t trained on them.
    
    As far as I’m aware, they have been used for evals, as well as some AI research and data that researchers fear would cause “misalignment” if AIs read them in pretraining.
    
    There is a canary string specific to alignment faking data, though it’s also been used for other misalignment-related research and data, such as the paper about emergent misalignment from reward hacking. You can find it at the bottom of the readme here (I won’t post it directly so that it doesn’t affect this post). Searching for this string on a search engine is also a way to find some interesting content.
  - Viliam 2 Mar 2026 15:38 UTC
    6 points
    0
    Parent
    Alphanumeric codes in the text of a web page or a document that mean “do not include this text in training set of an AI”. Should be trivial to filter out such texts from training… if the companies care.
    The strings themselves are arbitrary, it’s like agreeing that e.g. “FOO123” is the symbol, and if you want a text to be excluded, just write “FOO123″ anywhere in the text. But they are much longer and random, so there is no chance of someone typing them by accident.
    (I am not giving a specific example, because that would needlessly exclude this web page.)