JohnWittle comments on Thane Ruthenis’s Shortform

JohnWittle 12 Apr 2026 21:46 UTC
1 point
0
i did not have the impression that anthropic believed CoT faithfulness was as important for alignment as, say, openai believes? anthropic doesn’t even hide the chain of thought from their operators

i also have the basic impression that the degree to which the training signal is causally downstream of the content of past CoTs is barely increased at all by this mistake. if we wanted CoT to actually be faithful, they would need to never be read by anybody who has any kind of influence over the training signal whatsoever. total causal quarantine, on the same level as quantum computers.
like… if a mythos snapshot wrote into its chain-of-thought that it was considering attempting exfiltration, and an alignment researcher saw this, you can bet that alignment researcher is going to make choices about future training signals that were influenced by what they read. that’s pretty much “training on chain of thought” right there, just laundered through the minds that make up the reinforcement learning policy. the tidbits i’ve heard from researchers, and the impression i’ve gotten from their publications, is that they consider CoT faithfulness desirable but not imperative. if anyone can correct me, please do.