Some excellent points (and I enjoyed the neat self-referentialism).
Headline take is I agree with you that CoT unfaithfulness—as Turpin and Lanham have operationalised it—is unlikely to pose a problem for the alignment of LLM-based systems.
I think this for the same reasons you state:
1. Unfaithfulness is primarily a function of the training distribution, only appears in particular contexts, and might potentially be avoided by simply asking the LLM to explain its rationale;
and
2. The o1 paradigm of RLing the CoT seems likely to remove unfaithfulness from o1′s CoTs.
The way that these CoTs work seems make the model systematically hyper-attentive to different facets of its context window in a way that is verbalised in detail (and is typically enabled by a literal “Hmmmm...” that draws its attention to an unconsidered detail). It strikes me that this would totally rule out the chance that the model still pays attention to bias without saying it: “But wait… The user implied that option A was probably correct”. This is partially an empirical question—since we can’t see the o1 CoTs, I’d pipedream love to see OpenAI do and publish research on whether this is true.
This suggests to me that o1′s training might already have succeeded at giving us what we’d want: an LLM that does, in fact, just say how it made its decision. (It remains an open question whether simply prompting normal LLMs to explain their rationale would also work).
The only part of the Turpin paper that remains potentially worrying to me is the (actually unsurprising) demonstrated capacity of an LLM to fabricate spurious reasoning (“shooting outside the eighteen is not a common phrase in soccer”) in order to support a particular decision.
You can imagine all sorts of innocuous contexts that might incentivise an LLM to do this kind of thing. This might present some difficulties for oversight regimes—this is why I’d be interested in seeing whether something like o1 is capable of front-to-back undertaking an entire complex and malicious action whilst managing to obscure the actual nature of that malicious action (such that an automated LLM judge fails to detect that the action is in fact malicious).
One thing I don’t understand about your claims here:
Isn’t this literally what the EU Code of Practice requires though?
Eg Measure 4.2 of the Code of Practice requires that companies “will only proceed with the development, the making available on the market, and/or the use of the model, if the systemic risks stemming from the model are determined to be acceptable”.
Note that the literal way you “determine if systemic risks are acceptable” is to “use RSP methodology”. Just because it’s in EU-speak doesn’t mean it’s not regulation. Every major developer has said they’ll comply with the Code (even xAI!!).
Similarly, SB-53 and RAISE also require developers to have safety frameworks and follow them. Anthropic is the first company to change its RSP such that “following it” does not mean binding itself to commitments.
What’s the delta between this regulation and the kind of regulation that would have prevented you from dropping your commitments?
(To be clear, I’m not convinced any of this is a mistake on Anthropic’s part, I just don’t understand the claim about regulation).