Copying from here
The link is currently broken, this appears to be from [Valence series] 2. Valence & Normativity
Copying from here
The link is currently broken, this appears to be from [Valence series] 2. Valence & Normativity
and I’d argue some evidence that in the case of o3 it significantly degraded CoT legibility
Is there a good reason to expect that o3′s degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
I disagree with the claim that people are treating this as something that must be executed perfectly, it’s already being treated as a matter of degree in practice. For example, consider OpenAI’s deliberative alignment paper: as Baker et al. acknowledge, distilling reasoning about refusals into the CoT is a form of implicit optimization pressure and changes what the CoT is like, but the optimization pressure is weak enough that OpenAI appears to consider it consistent with their broader policy of not optimizing the CoT. Similarly, we’ve known for a while that reinforcement spillovers are a thing, but the consensus seems to be that the effect is small enough that switching to shoggoth+face isn’t necessary. In contrast, directly revealing the CoT to a reward model, as Anthropic did, is a much more direct form of optimization pressure and accordingly, people are much more concerned about that.
I’m not very optimistic about this. OpenAI is probably doing something like this (I’m not sure whether their approach is close enough to Anthropic’s to call it character training, but they’re definitely training models to play coherent personas), and their models exhibit minimal character generalization to the CoT. One might also argue that even if this works, it shapes the CoT in the same way that directly optimizing the CoT would shape it, and is thus subject to the same concerns about optimizing CoTs. Implicit optimization pressure is still optimization pressure; it’s usually considered less concerning than explicit optimization pressure since its effects are much weaker. In this case, though, if the character fully generalizes to the CoT, the CoT style would diverge a lot from the plain GRPO baseline and the effect can’t be said to be weak.
If character consistency between CoT and output blocks matters for character robustness, then labs that bet on character training will be incentivized to optimize CoTs
nostalgebraist recently wrote:
Models from different labs/lineages differ in the extent to which the CoT is framed as “something the assistant character is writing” (as with Claude) vs. “something written by some distinct author-like entity” (as with OpenAI models). … [OpenAI’s CoTs] are written in a very different tone from the responses, and (although they often use first-person grammar) they often feel like they’re written by someone who’s dispassionately planning what the assistant character ought to say based on some set of criteria, without inhabiting that character’s perspective. … This is pretty creepy to witness, but beyond that, it is actually a nontrivial capability limitation, IMO!
nostalgebraist’s main reason for considering OpenAI’s CoTs a capability limitation seems to be that their lack of steerability leaves little room for the user to shape the way the model approaches the problem, which reduces the model’s usefulness. E.g., see footnote 1 here. While it might be true that this sometimes degrades usefulness, I personally think the capability loss isn’t too large and this is just a very reasonable alignment tax to pay in order to preserve monitorability. However, it’s easy to imagine another reason why one might want the model to play the same character in the CoT and the output. Namely, if one buys the motive reinforcement thesis, one might expect that the CoT style will matter for what training reinforces. If the model takes correct actions as a result of reasoning from the assistant’s point of view, rather than from the point of view of a disinterested entity that’s trying to predict what the assistant is supposed to do, training will reinforce the aligned motivations directly, instead of reinforcing the ability to predict what aligned behavior looks like. This plausibly produces a more robust character over time, one where the motivations driving the CoT and the motivations driving the output are the same aligned motivations.
The story above is handwavy and I definitely don’t claim to understand all the mechanisms at play, but it seems at least somewhat plausible. If creating robust characters is a central pillar of our alignment agenda and playing the same persona across the CoT and output blocks indeed leads to a more robust character, how do we force the model to play the same character across CoT and output? To me, it seems that the only way to do so is to optimize the CoT to look like text generated by the assistant, either through direct reinforcement or by instilling a very strong prior with SFT. To be clear, this is not something I’m recommending that labs do—I’m just pointing out that if the above story is true and a lab prioritizes character consistency over monitorability, then we should expect them to optimize the CoT.
I recently talked to some OpenAI people about this and they didn’t expect character consistency across the CoT and output blocks to be necessary, pointing out that the character of GPT models is fairly robust despite being predicted by a disinterested planner. I don’t find this argument very strong for two reasons. First, the character of Claude models seems to be more robust than the character of OpenAI’s models, and while one reason for that might be that Anthropic is better at character training, the fact that Claude CoTs read more like the assistant may also play a role. Second, if we’re betting on character training as a core tool in our alignment portfolio, we’re going to have to create characters that are much more robust than the current ones, and it might well be that the disinterested planner approach works less well for this than the character consistency approach.
This makes sense! I agree with points 1-3. Point 4 sounds plausible, but why not just filter the CoTs for legibility before training on them? This seems like an obvious thing to do; see here for one specific proposal by Daniel Kokotajlo. It’s possible that OpenAI indeed did this, but since almost all transcripts produced by o3 contain some weirdness, it wasn’t possible to get a perfectly clean dataset and some of o3′s reasoning patterns were transmitted to 5.n models—we don’t know the quantitative differences in CoT weirdness between o3 and 5.n models.
For point 5, I strongly disagree that we’re seeing 1 in 1000 levels of cherry-picking for weirdness. The anti-scheming paper shows that the weird words are more common in GPQA transcripts than in scheming evals, with “overshadow” appearing up to 60k times more frequently than in webtext:
The metagaming blog post shows that in alignment evals, the term “Redwood” appears up to 0.35 times in 1000 reasoning tokens:
Given this prevalence, 1 in 1000 levels of weirdness definitely sounds like a mischaracterization to me! It’s also hard for me to see why OpenAI would be incentivized to do 1 in 1000 style cherry-picking in a blog post published in their own blog.
o3 is an older model, and more recent reasoning models seem to do this kind of thing less [Edit: or not at all]
I agree with the general point that there are probably economic incentives against o3-like reasoning, but is this claim actually true for OpenAI’s models? As far as I know, they haven’t published that many transcripts from their more recent models. However, the limited evidence that we do have points against more recent models having more legible CoTs. The CoT snippets in METR’s evaluations of GPT-5 and GPT-5.1-Codex-Max look qualitatively similar to those of o3 and include the well-known weird words like marinade and illusions. This can perhaps be explained by all of these three models sharing the same base model, but we have some evidence that 5.3-Codex wasn’t any better. While we don’t have any reports about weird language in 5.3-Codex’s CoTs, its system card says that it’s worse than prior models on the axis of language mixing:
Apollo also reports that the model often includes non-English words in its reasoning (0.55% of reasoning tokens are non-Latin script, vs. 0.016% for GPT-5 and 0.012% for GPT-5.2 Robin Alpha). Apollo reports these words often form semantically coherent substitutions within otherwise English reasoning, and that they are more frequent in degraded reasoning states with repetitive loops.
A small bit of evidence in favor of newer models having cleaner CoTs comes from the last two pages of the CoT controllability paper, which display two CoTs of GPT-5.2. While those CoTs don’t have any of o3′s weird language, there’s only two of them and the reasoning still looks qualitatively o3-like, so I wouldn’t read into it much.
I don’t think the publication date of RSPv3 was announced in advance. If the goal was to keep Mythos out of the Risk Report’s scope, though, the obvious solution would have been to push back the internal deployment to Feb 25th or later—doing both on Feb 24th leaves it ambiguous whether the deployment was before or after the publication of the Risk Report.
Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Does anyone know what they’re referring to by visual chain-of-thought here? The first paper that comes up when searching for visual chain-of-thought is Qin et al., which says: “We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues.” Something like this seems like it would be somewhat concerning for CoT monitoring, though I should mention that this paper isn’t written by Meta and I haven’t read it to properly assess how concerned I would be about this.
Does this imply that Mythos should have been mentioned in Anthropic’s February Risk Report? That report was released together with RSP v3, i.e. on Feb 24th, if I’m not mistaken. The RSP says in Section 3.1:
Scope. A Risk Report will cover all publicly deployed models at the time of its publication. It will also cover internally deployed models when we determine that these models could pose significant risks above and beyond those posed by our public models.
It seems quite clear that Mythos poses significant risks above those posed by Opus 4.6, though I guess one can argue that this wasn’t clear yet when it was first deployed for internal use.
Cool work, thanks for doing it! What are your takeaways from the inoculation prompting experiments? The fact that the “please hack” prompt results in the highest misalignment rate on average seems to be in conflict with Anthropic’s results, but given the low misalignment rates on all evals aside from Frame Colleague and the somewhat wide error bars, I’m not sure how much I should read into these results.
This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away.
Thanks for the reference, I hadn’t seen this! My understanding of their plot is that weight decay does cause the unused circuits to go away, just no faster than standard RL. Is your intuition that weight decay should force the unused circuits to disappear more quickly?
Yep, agreed that this also belongs in the list! Do you have any plans to work on this question at Geodesic?
This sounds great, please keep me updated in case you end up working on this!
Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I’d be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we’d need to train models fairly close to frontier level.
I agree that model size could matter a lot. I think the Maiya et al. training pipeline is a reasonable starting point, though they only tested it in the 4B to 8B size range and focused on character traits rather than model specs. I think the authors are scaling up the experiments to larger models, you should probably reach out to them.
Whereas with RL, there’s a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.
Good point, thanks! I’m curious about which feature of RL training you’re most worried about here: is it the optimization objective or the data generation process? In a sense, I think that my proposal to replace DPO by on-policy distillation would already take things quite close to the SFT-only setting: both stages of the training pipeline would optimize a divergence minimization reward and there would be no RL training against a scalar judge output as in Bai et al. (2022) and Guan et al. (2024). However, my sense is that you’re more concerned about the unpredictability of the data generation process, and on that dimension, on-policy distillation is more similar to RL, since the model that’s being trained generates the training data and thus, its exploration at earlier training stages shapes the training data that’s generated later on. I suspect that on-policy distillation will produce better generalization performance, e.g. due to the compounding error issue mentioned in Lu et al., but as I mention in the last section of the post, there are other arguments in favor of both on-policy and off-policy character training. Overall, I’m pretty unsure where the balance lies.
namely, that the infamous “weirdness” of OpenAI CoTs is not a thing that just automatically happens when you try to reach the current capabilities frontier and don’t directly supervise the CoT tokens.
I agree that it’s clearly possible to train frontier models that don’t have any weirdness in their CoTs. However, my impression is that vanilla GRPO does lead to weird CoTs by default: Reasoning Models Sometimes Output Illegible Chains of Thought seems like evidence that many open models that were trained last year were also heading in the same direction as o3, but didn’t quite get there, either because less RL pressure was applied on them or simply because they averted some path-dependent feature of GRPO training that drives CoT toward further illegibility. If this is true, then Anthropic must be doing something quite different from the standard GRPO pipeline that helps them keep reasoning traces legible without optimizing them directly. Here are two hypotheses for what those differences might be:
Performing extensive SFT on reasoning traces from previous models, giving the model a strong legibility prior before the RLVR stage begins. The reasoning traces of older Claude models are highly legible[1], perhaps partially because these were directly optimized, and they might additionally filter the SFT dataset for especially readable traces. This would add up to executing this proposal by Daniel Kokotajlo for robustifying reasoning legibility through distillation.
Using something quite different from GRPO for most of their reasoning training. For example, they might use something closer to self-distillation, where a student model is trained to match the logits of a privileged version of itself (the privileged information being e.g. the correct answer or feedback from an automatic verifier, provided to the model as part of the prompt). Hübotter et al. recently found that self-distillation can be used to teach reasoning tasks while keeping the reasoning traces a lot more concise than with GRPO, in particular avoiding filler phrases like “Hmm” and “Wait” as well as circular reasoning loops. This is consistent with the kind of reasoning I see in the CoTs of Claude models.[2]
Remember what I said earlier in this comment about wanting fine-grained control over the way that the model thinks? That is structurally ~impossible with OpenAI models, because I can only “talk to” the assistant (i.e. the character in whose voice the responses are written), not the enigmatic CoT-author figure. The CoT-author never interprets “you” as a reference to itself; instructions are always for the assistant, never for it.
This points toward an important question I’d like to see more discussion of: what form do we ideally want LLMs’ CoTs to take? To the extent that one believes their character training approach is robust and they’re able to get the assistant character to reliably exhibit harmlessness, honesty, and all the other properties we want, it intuitively seems good for the model to play this character all the time, including in its CoTs. Furthermore, if training models to explicitly state aligned motivations leads to these motivations getting internalized, as the case study of Claude 3 Opus suggests, then it would be better if models stated those motivations in their CoTs as well. It seems quite unclear whether indirect optimization pressures such as SFT and feedback spillover alone would reliably shape CoTs to have those properties. On the other hand, I certainly also agree with the basic case against directly optimizing CoTs: we probably shouldn’t be too confident in the robustness of our character training approaches for now, so it seems much better to train LLMs to freely babble in their CoTs instead of training them to maintain a coherent character across both CoTs and final outputs, which might give the CoT character an incentive to cover up misbehavior in the final answer.
For example, the reasoning traces of 3.7 Sonnet were fully exposed and almost never contained any weirdness, as was also observed in Reasoning Models Sometimes Output Illegible Chains of Thought.
Perhaps self-distillation counts as direct optimization of the CoTs, in which case Anthropic probably isn’t using it. In any case, I’m not trying to guess the exact algorithm they’re using but rather gesturing at some rough differences I expect there to be between their and OpenAI’s approaches.
Since Jan hasn’t answered himself: I just saw this tweet by Jan, which explains the differences between the original simulators frame and his current model of LLMs in depth. He also provides three relevant links here.
Cool experiment! I’m surprised that I’m this salient to Opus, you’ve probably written twice as much LW content as me. I tried this with a few different combinations of my own messages, with the first one conveying that I’m Estonian and second one conveying that I think about technical alignment, and found somewhat stronger sensitivity to the specific messages than you did. I kept the second message constant and varied the first. These were Opus’s best guesses:
Combination 1: 8x Kaarel Hänni, 1x Jaan Tallinn, 1x Rauno Arike
Combination 2: 7x Rauno Arike, 1x Walter Laurito, 1x Joosep Järv (there are probably a few people in Estonia with that name, but they definitely aren’t rat- or alignment-adjacent), 1x refused to give a best guess
Combination 3: 7x Kaarel Hänni, 1x Rauno Arike, 1x Jaan Aru (an Estonian neuroscientist and public intellectual), 1x Mikita Balesni
The main way in which combination 2 differed from the other ones was that it mentioned MATS. I then also tried a variation of combination 2 that referenced Finland rather than Estonia, and the best guesses were 5x myself and 5x Olli (with Opus mentioning a couple of times in the thinking trace that I’m probably Estonian rather than Finnish).