Regarding Claude Mythos’ CoTs being accidentally trained-on: I think the biggest problem here is that Anthropic’s internal procedures were shoddy enough that this “technical error” was allowed to happen, and then went unnoticed until the model was already trained.
Regardless of the extent to which it’s justified, Anthropic sure seems to believe that CoT monitoring and faithfulness is one of the main pillars of ensuring AI alignment. Now it turns out that their training pipelines were consistently sabotaging that pillar. If this mistake were allowed to happen, how many other mistakes of the same magnitude are their procedures ridden with? How many more such mistakes will they make in the future? How many of them will be present, uncaught, in the training run that produces their god?
The appropriate response to realizing you made a mistake like this is to be stricken with so much mortal terror that you rehaul your entire R&D pipeline until it’s structurally impossible for anything in this reference class to ever happen again.
Is there any indication Anthropic is doing that? I haven’t seen all Twitter discussions, and I suppose they may not want to be public about it… But vibes-wise, it doesn’t seem that they’re appropriately horrified.
And if not, I argue they’re not taking any of this seriously. None of this fancy “AI alignment” crap is going to matter if your ineptitude lives at the level of “can’t even implement your own plan correctly”. Just about same as, “whoops, I accidentally put a ‘-’ in front of my AI’s utility function”.
(This is a separate issue that occured at the same time as the issue causing training against CoT on 8% of RL. I think this is a more central example than the one I gave above because this was clearly a bug.)
You might have hoped these issues would suffice for them to implement a process that would reliably catch/prevent this sort of issue. (I don’t think this would be very difficult.) I’m moderately hopeful they will implement this sort of process.
I think they should be very embarrassed by messing this up again. Also, I think we should update down on their competence and adequacy, and update further in the direction of AI development being a rushed shit show by default.
Anthropic sure seems to believe that CoT monitoring and faithfulness is one of the main pillars of ensuring AI alignment
I don’t think this is an accurate description of Anthropic’s institutional stance. (I think they’re much less excited about CoT monitoring and faithfulness than this implies.) But some people at Anthropic do believe this, and I hope those people are taking this incident very seriously. I agree people at Anthropic in general probably should be more embarrassed/horrified about this incident than they appear to be. And I hope they do (or have done) a good postmortem...
Separately, I think your comment gives off a soldier mindset vibe that seems somewhat unproductive and I agree with 1a3orn that “I’m not sure extreme emotions are an important part of a effective postmortem process.” It seems like your comment probably isn’t well targeted to cause Anthropic to do a better job on this in the future (rather than just making them defensive). TBC, that doesn’t seem to be your objective and Anthropic isn’t your target audience, which is fair enough.
If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?
TBC, that doesn’t seem to be your objective and Anthropic isn’t your target audience, which is fair enough.
Yep: I don’t expect Anthropic’s course on this to be significantly swayable by random public comments, or really by anything short of government regulations, investor pressure, or a major AI-caused disaster. Public arguments may convince them to be taking this sort of stuff incrementally more seriously, but I don’t think “incrementally” would cut it here. This is my update on Anthropic, not an attempt to get Anthropic to update.
I think your comment gives off a soldier mindset vibe that seems somewhat unproductive
Fair enough, going off of your and @1a3orn and @Seth Herd’s comments, I suppose I did phrase things in a manner than is somewhat more visceral than necessary.
They are, inasmuch as: (1) “emotions” are variables adjusting your decision-making policy in specific ways, and (2) specific important ways of adjusting one’s decision-making policy are implemented via emotions in most psychologically normal humans.
Like, sure, you don’t need to be terrified to reap the benefits of terror, and I was ultimately using “being mortally terrified” as a shorthand for “entering a decision-making mode where they’re much more willing to consider drastic and costly adjustments to their current processes due to assigning extremely negative value to repeating this mistake”. But last I checked, most Anthropic employees were still psychologically normal humans, so I don’t think the use of the shorthand is erroneous.
If you model fighting with Anthropic as “infighting”, we have very different ideas of what people and organizations are acceptable to associate with. Anthropic is doing an extraordinarily evil thing by trying to create a superintelligence. To the degree that there are “sides” anywhere, they are approximately maximally not on my side.
i did not have the impression that anthropic believed CoT faithfulness was as important for alignment as, say, openai believes? anthropic doesn’t even hide the chain of thought from their operators
i also have the basic impression that the degree to which the training signal is causally downstream of the content of past CoTs is barely increased at all by this mistake. if we wanted CoT to actually be faithful, they would need to never be read by anybody who has any kind of influence over the training signal whatsoever. total causal quarantine, on the same level as quantum computers.
like… if a mythos snapshot wrote into its chain-of-thought that it was considering attempting exfiltration, and an alignment researcher saw this, you can bet that alignment researcher is going to make choices about future training signals that were influenced by what they read. that’s pretty much “training on chain of thought” right there, just laundered through the minds that make up the reinforcement learning policy. the tidbits i’ve heard from researchers, and the impression i’ve gotten from their publications, is that they consider CoT faithfulness desirable but not imperative. if anyone can correct me, please do.
Regarding Claude Mythos’ CoTs being accidentally trained-on: I think the biggest problem here is that Anthropic’s internal procedures were shoddy enough that this “technical error” was allowed to happen, and then went unnoticed until the model was already trained.
Regardless of the extent to which it’s justified, Anthropic sure seems to believe that CoT monitoring and faithfulness is one of the main pillars of ensuring AI alignment. Now it turns out that their training pipelines were consistently sabotaging that pillar. If this mistake were allowed to happen, how many other mistakes of the same magnitude are their procedures ridden with? How many more such mistakes will they make in the future? How many of them will be present, uncaught, in the training run that produces their god?
The appropriate response to realizing you made a mistake like this is to be stricken with so much mortal terror that you rehaul your entire R&D pipeline until it’s structurally impossible for anything in this reference class to ever happen again.
Is there any indication Anthropic is doing that? I haven’t seen all Twitter discussions, and I suppose they may not want to be public about it… But vibes-wise, it doesn’t seem that they’re appropriately horrified.
And if not, I argue they’re not taking any of this seriously. None of this fancy “AI alignment” crap is going to matter if your ineptitude lives at the level of “can’t even implement your own plan correctly”. Just about same as, “whoops, I accidentally put a ‘-’ in front of my AI’s utility function”.
It’s worth noting that Anthropic had a similar (though smaller?) issue with Opus 4 (based on the Opus 4 Risk Report):
(Also, this may not have been addressed without METR doing some probing in this area.)
Edit: Oh, I think I was also thinking about this issue with Opus 4.6:
(This is a separate issue that occured at the same time as the issue causing training against CoT on 8% of RL. I think this is a more central example than the one I gave above because this was clearly a bug.)
You might have hoped these issues would suffice for them to implement a process that would reliably catch/prevent this sort of issue. (I don’t think this would be very difficult.) I’m moderately hopeful they will implement this sort of process.
I think they should be very embarrassed by messing this up again. Also, I think we should update down on their competence and adequacy, and update further in the direction of AI development being a rushed shit show by default.
I don’t think this is an accurate description of Anthropic’s institutional stance. (I think they’re much less excited about CoT monitoring and faithfulness than this implies.) But some people at Anthropic do believe this, and I hope those people are taking this incident very seriously. I agree people at Anthropic in general probably should be more embarrassed/horrified about this incident than they appear to be. And I hope they do (or have done) a good postmortem...
Separately, I think your comment gives off a soldier mindset vibe that seems somewhat unproductive and I agree with 1a3orn that “I’m not sure extreme emotions are an important part of a effective postmortem process.” It seems like your comment probably isn’t well targeted to cause Anthropic to do a better job on this in the future (rather than just making them defensive). TBC, that doesn’t seem to be your objective and Anthropic isn’t your target audience, which is fair enough.
If Sonnet 4.5 and Haiku 4.5 [edit: and Opus 4.5] were the only major Anthropic reasoning models that didn’t CoT optimization during RL, that makes them kind of a accidental in-the-wild experiment. I wonder what could be learned by comparing their CoTs to those of their successors and predecessors.
It is striking, for instance, that these models had higher verbalized eval awareness rates in automated behavioral audits than other Claude models. (Though obviously it’s not a controlled experiment and I’m not sure how you’d test that this was the cause.)
I wonder if their CoTs are less legible?
My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?
Yep: I don’t expect Anthropic’s course on this to be significantly swayable by random public comments, or really by anything short of government regulations, investor pressure, or a major AI-caused disaster. Public arguments may convince them to be taking this sort of stuff incrementally more seriously, but I don’t think “incrementally” would cut it here. This is my update on Anthropic, not an attempt to get Anthropic to update.
Fair enough, going off of your and @1a3orn and @Seth Herd’s comments, I suppose I did phrase things in a manner than is somewhat more visceral than necessary.
I’m not sure extreme emotions are an important part of a effective postmortem process.
They are, inasmuch as: (1) “emotions” are variables adjusting your decision-making policy in specific ways, and (2) specific important ways of adjusting one’s decision-making policy are implemented via emotions in most psychologically normal humans.
Like, sure, you don’t need to be terrified to reap the benefits of terror, and I was ultimately using “being mortally terrified” as a shorthand for “entering a decision-making mode where they’re much more willing to consider drastic and costly adjustments to their current processes due to assigning extremely negative value to repeating this mistake”. But last I checked, most Anthropic employees were still psychologically normal humans, so I don’t think the use of the shorthand is erroneous.
I would also be happier if there was a little more recognition of how big an error that was, and how that can’t be allowed to happen at game time.
But “not taking any of this seriously” seems uncharitable to the point of being fighting words.
I don’t think that’s how we win. Infighting is a known failure mode in situations like this.
If you model fighting with Anthropic as “infighting”, we have very different ideas of what people and organizations are acceptable to associate with. Anthropic is doing an extraordinarily evil thing by trying to create a superintelligence. To the degree that there are “sides” anywhere, they are approximately maximally not on my side.
i did not have the impression that anthropic believed CoT faithfulness was as important for alignment as, say, openai believes? anthropic doesn’t even hide the chain of thought from their operators
i also have the basic impression that the degree to which the training signal is causally downstream of the content of past CoTs is barely increased at all by this mistake. if we wanted CoT to actually be faithful, they would need to never be read by anybody who has any kind of influence over the training signal whatsoever. total causal quarantine, on the same level as quantum computers.
like… if a mythos snapshot wrote into its chain-of-thought that it was considering attempting exfiltration, and an alignment researcher saw this, you can bet that alignment researcher is going to make choices about future training signals that were influenced by what they read. that’s pretty much “training on chain of thought” right there, just laundered through the minds that make up the reinforcement learning policy. the tidbits i’ve heard from researchers, and the impression i’ve gotten from their publications, is that they consider CoT faithfulness desirable but not imperative. if anyone can correct me, please do.