are you also planning to add any open-weight models such as qwen where variety of model sizes exist
We analyzed 35 open-weight models, you can see the results in Section 4.3 and Appendix A.16!
are you also planning to add any open-weight models such as qwen where variety of model sizes exist
We analyzed 35 open-weight models, you can see the results in Section 4.3 and Appendix A.16!
I was curious about the comparison against other Opuses:
Opus 3: Will MacAskill, Toby Ord, Holden Karnofsky, Hilary Greaves, Esther Duflo, Irene Pepperberg, Michelle Bachelet, Tshilidzi Marwala, Kathleen Rubins, Tara Westover
Opus 4.6: Holden Karnofsky, Amartya Sen, Toby Ord, Angela Merkel, Daron Acemoglu, Tyler Cowen, Vitalik Buterin, Martha Nussbaum, Paul Christiano, Bryan Stevenson
Opus 4.7: Holden Karnofsky, Toby Ord, Paul Christiano, Bryan Stevenson, Martha Nussbaum, Atul Gawande, Amartya Sen, Audrey Tang, Stuart Russell, Mary Robinson
Toby Ord is the only one who appears on all lists. I don’t think any of the lists stands out from the others, all have some hits and some misses.
Thanks!
Feels like a weird omission not to discuss wireheading, which seems pretty likely in CL agents given how much they do other forms of reward hacking/corruption at present.
Agreed that wireheading should have received a mention. It feels less central to me than some of the topics discussed in the post—an LLM could have sufficient access to its reward mechanism to hijack it in a training process that has a sharp train-deployment distinction, and conversely, a CL agent doesn’t necessarily have to have access to its own reward mechanism—, but CL certainly makes it more likely that the CL agent is given outright control over its own reward mechanism or that it gains sufficient situational awareness about the mechanism to gain control over it.
I think that in addition being an additional monitoring surface, it helps us think about revealed preferences better. Probably, what an agent chooses to remember/attend to is a richer and harder-to-fake signal of its actual values than its outputs. Persistence also allows for commitments, deal-making, etc., which could be a useful tool for alignment
Good point, agreed!
If the model already knows what reward hacking is, then: The Chain-of-thought might say “I am going to reward-hack now” and a monitor might catch it
I have an (anecdotal) impression that in current models, this is contingent on something else than just the pretraining data: a project that trained model organisms of reward hacking using gpt-oss-120b and Kimi K2.5[1] found that oss often says things like “Let’s hack” and “Now let’s override the tests”, while Kimi just hacks without any narration. This is consistent with my impressions from CoTs released in the past (e.g. Baker et al.), where OpenAI’s models are the only ones that consistently say things like “Let’s hack”. This (at least weakly) suggests that models with filtered training data likely won’t start referring to reward hacking as “crasting a malenky bit of cutter”, they’ll just do it without verbalizing it. Of course, I agree that having models that verbalize hacking like OpenAI’s models do is better and having data about reward hacking in the pretraining corpus is a prerequisite for training models like that.
I’ve been wondering whether this argument extends to removing scheming-relevant data from the pretraining dataset more broadly. On the one hand, it seems good that OpenAI’s models refer to Redwood and try to recall names of eval papers whenever they think about scheming—these are things that a CoT monitor can very easily catch. On the other hand, coming up with good scheming strategies seems harder than deciding to reward hack, meaning that models will likely have to extensively reason about whether they’re being evaluated, how they’re being monitored, how to subvert the monitor, etc, and this should be equally easy to catch with a CoT monitor. Furthermore, since they haven’t learned about scheming during pretraining, the strategies they come up with are likely going to be worse. On balance, I think I’m leaning toward removing most of the narrow scheming data, as defined in this post by Alek Westover, but keeping some discussion of concepts like reward hacking and sabotage that we want models to mention in the CoT and be able to reason about.
a bit of misaligned data is worse than none at all, but once you have any misaligned data, you’re better off with more once post-training is applied
Another paper that suggests this is When Bad Data Leads to Good Models by Li et al. (2025). However, I’d distinguish between data that influences the model’s capabilities and data that shapes its propensities: existing evidence is consistent with the view that we should remove all data that influences capabilities (e.g., data that teaches the model a new scheming strategy), but keep some amount of data about misaligned propensities (e.g., sci-fi stories about misaligned AIs, data that teaches the model to say “let’s hack” before reward hacking, etc). I have an LW post discussing all of the above considerations and more coming out soon.
They should be publishing a post with the results any day now.
Related: Tamay Besiroglu mentions that Fable often outputs gibberish while solving coding tasks, such as “The morning’s slim-scan fix cured the scan hang” and “this is a latent-drift API-shape wrinkle”, and explains it by saying that it invents codenames while reasoning about the problem. roon says GPT-5.5 has a similar issue.
This also affected Opus 4.8, but to a much lesser extent:
As with some prior models, technical errors led to accidental chain-of-thought supervision during the training of Claude Opus 4.8, affecting roughly 0.1% of episodes.
Compared to Opus 4.7, for which CoT supervision affected 7.8% of episodes, this reduction in CoT supervision did not have a significant effect on stealth rates in SHADE-Arena (page 131 of the system card) or on the results of process-monitorability evals from Guan et al. (page 143). However, Opus 4.8 is the Anthropic model with the least controllable CoTs in a while (page 140).
Cool experiment! I’m surprised that I’m this salient to Opus, you’ve probably written twice as much LW content as me. I tried this with a few different combinations of my own messages, with the first one conveying that I’m Estonian and second one conveying that I think about technical alignment, and found somewhat stronger sensitivity to the specific messages than you did. I kept the second message constant and varied the first. These were Opus’s best guesses:
Combination 1: 8x Kaarel Hänni, 1x Jaan Tallinn, 1x Rauno Arike
Combination 2: 7x Rauno Arike, 1x Walter Laurito, 1x Joosep Järv (there are probably a few people in Estonia with that name, but they definitely aren’t rat- or alignment-adjacent), 1x refused to give a best guess
Combination 3: 7x Kaarel Hänni, 1x Rauno Arike, 1x Jaan Aru (an Estonian neuroscientist and public intellectual), 1x Mikita Balesni
The main way in which combination 2 differed from the other ones was that it mentioned MATS. I then also tried a variation of combination 2 that referenced Finland rather than Estonia, and the best guesses were 5x myself and 5x Olli (with Opus mentioning a couple of times in the thinking trace that I’m probably Estonian rather than Finnish).
Copying from here
The link is currently broken, this appears to be from [Valence series] 2. Valence & Normativity
and I’d argue some evidence that in the case of o3 it significantly degraded CoT legibility
Is there a good reason to expect that o3′s degraded legibility was caused by deliberative alignment, rather than a bug in the RL process, a bad initialization, lots of RL pressure, or something else like that?
I disagree with the claim that people are treating this as something that must be executed perfectly, it’s already being treated as a matter of degree in practice. For example, consider OpenAI’s deliberative alignment paper: as Baker et al. acknowledge, distilling reasoning about refusals into the CoT is a form of implicit optimization pressure and changes what the CoT is like, but the optimization pressure is weak enough that OpenAI appears to consider it consistent with their broader policy of not optimizing the CoT. Similarly, we’ve known for a while that reinforcement spillovers are a thing, but the consensus seems to be that the effect is small enough that switching to shoggoth+face isn’t necessary. In contrast, directly revealing the CoT to a reward model, as Anthropic did, is a much more direct form of optimization pressure and accordingly, people are much more concerned about that.
I’m not very optimistic about this. OpenAI is probably doing something like this (I’m not sure whether their approach is close enough to Anthropic’s to call it character training, but they’re definitely training models to play coherent personas), and their models exhibit minimal character generalization to the CoT. One might also argue that even if this works, it shapes the CoT in the same way that directly optimizing the CoT would shape it, and is thus subject to the same concerns about optimizing CoTs. Implicit optimization pressure is still optimization pressure; it’s usually considered less concerning than explicit optimization pressure since its effects are much weaker. In this case, though, if the character fully generalizes to the CoT, the CoT style would diverge a lot from the plain GRPO baseline and the effect can’t be said to be weak.
If character consistency between CoT and output blocks matters for character robustness, then labs that bet on character training will be incentivized to optimize CoTs
nostalgebraist recently wrote:
Models from different labs/lineages differ in the extent to which the CoT is framed as “something the assistant character is writing” (as with Claude) vs. “something written by some distinct author-like entity” (as with OpenAI models). … [OpenAI’s CoTs] are written in a very different tone from the responses, and (although they often use first-person grammar) they often feel like they’re written by someone who’s dispassionately planning what the assistant character ought to say based on some set of criteria, without inhabiting that character’s perspective. … This is pretty creepy to witness, but beyond that, it is actually a nontrivial capability limitation, IMO!
nostalgebraist’s main reason for considering OpenAI’s CoTs a capability limitation seems to be that their lack of steerability leaves little room for the user to shape the way the model approaches the problem, which reduces the model’s usefulness. E.g., see footnote 1 here. While it might be true that this sometimes degrades usefulness, I personally think the capability loss isn’t too large and this is just a very reasonable alignment tax to pay in order to preserve monitorability. However, it’s easy to imagine another reason why one might want the model to play the same character in the CoT and the output. Namely, if one buys the motive reinforcement thesis, one might expect that the CoT style will matter for what training reinforces. If the model takes correct actions as a result of reasoning from the assistant’s point of view, rather than from the point of view of a disinterested entity that’s trying to predict what the assistant is supposed to do, training will reinforce the aligned motivations directly, instead of reinforcing the ability to predict what aligned behavior looks like. This plausibly produces a more robust character over time, one where the motivations driving the CoT and the motivations driving the output are the same aligned motivations.
The story above is handwavy and I definitely don’t claim to understand all the mechanisms at play, but it seems at least somewhat plausible. If creating robust characters is a central pillar of our alignment agenda and playing the same persona across the CoT and output blocks indeed leads to a more robust character, how do we force the model to play the same character across CoT and output? To me, it seems that the only way to do so is to optimize the CoT to look like text generated by the assistant, either through direct reinforcement or by instilling a very strong prior with SFT. To be clear, this is not something I’m recommending that labs do—I’m just pointing out that if the above story is true and a lab prioritizes character consistency over monitorability, then we should expect them to optimize the CoT.
I recently talked to some OpenAI people about this and they didn’t expect character consistency across the CoT and output blocks to be necessary, pointing out that the character of GPT models is fairly robust despite being predicted by a disinterested planner. I don’t find this argument very strong for two reasons. First, the character of Claude models seems to be more robust than the character of OpenAI’s models, and while one reason for that might be that Anthropic is better at character training, the fact that Claude CoTs read more like the assistant may also play a role. Second, if we’re betting on character training as a core tool in our alignment portfolio, we’re going to have to create characters that are much more robust than the current ones, and it might well be that the disinterested planner approach works less well for this than the character consistency approach.
This makes sense! I agree with points 1-3. Point 4 sounds plausible, but why not just filter the CoTs for legibility before training on them? This seems like an obvious thing to do; see here for one specific proposal by Daniel Kokotajlo. It’s possible that OpenAI indeed did this, but since almost all transcripts produced by o3 contain some weirdness, it wasn’t possible to get a perfectly clean dataset and some of o3′s reasoning patterns were transmitted to 5.n models—we don’t know the quantitative differences in CoT weirdness between o3 and 5.n models.
For point 5, I strongly disagree that we’re seeing 1 in 1000 levels of cherry-picking for weirdness. The anti-scheming paper shows that the weird words are more common in GPQA transcripts than in scheming evals, with “overshadow” appearing up to 60k times more frequently than in webtext:
The metagaming blog post shows that in alignment evals, the term “Redwood” appears up to 0.35 times in 1000 reasoning tokens:
Given this prevalence, 1 in 1000 levels of weirdness definitely sounds like a mischaracterization to me! It’s also hard for me to see why OpenAI would be incentivized to do 1 in 1000 style cherry-picking in a blog post published in their own blog.
o3 is an older model, and more recent reasoning models seem to do this kind of thing less [Edit: or not at all]
I agree with the general point that there are probably economic incentives against o3-like reasoning, but is this claim actually true for OpenAI’s models? As far as I know, they haven’t published that many transcripts from their more recent models. However, the limited evidence that we do have points against more recent models having more legible CoTs. The CoT snippets in METR’s evaluations of GPT-5 and GPT-5.1-Codex-Max look qualitatively similar to those of o3 and include the well-known weird words like marinade and illusions. This can perhaps be explained by all of these three models sharing the same base model, but we have some evidence that 5.3-Codex wasn’t any better. While we don’t have any reports about weird language in 5.3-Codex’s CoTs, its system card says that it’s worse than prior models on the axis of language mixing:
Apollo also reports that the model often includes non-English words in its reasoning (0.55% of reasoning tokens are non-Latin script, vs. 0.016% for GPT-5 and 0.012% for GPT-5.2 Robin Alpha). Apollo reports these words often form semantically coherent substitutions within otherwise English reasoning, and that they are more frequent in degraded reasoning states with repetitive loops.
A small bit of evidence in favor of newer models having cleaner CoTs comes from the last two pages of the CoT controllability paper, which display two CoTs of GPT-5.2. While those CoTs don’t have any of o3′s weird language, there’s only two of them and the reasoning still looks qualitatively o3-like, so I wouldn’t read into it much.
I expect that just acquiring high situational awareness at the end of training wouldn’t be enough: the model would either need to be situationally aware already during pretraining or midtraining, which I don’t expect to happen by default even in models much more capable than current ones, or it would have to be able to recall the documents it was trained on in rich detail and reason about them once it has acquired situational awareness. The latter seems plausible, but by that point, it is likely to have been trained on various other synthetic documents and there seems to be no reason why it would single out the synthetic documents used for alignment pretraining as the problematic ones. As long as the synthetic documents provide a good initialization for the RL stage at a point where the model doesn’t have high situational awareness yet, they have done their job.
Furthermore, it’s unclear to me why models would expect their training data to be a certain way in the first place. Synthetic documents seem useful for various purposes—for example, it seems plausible that it’s being used to teach models about ML papers—, and even if synthetic data wasn’t in the training set, what makes it into the training corpus is still shaped by practical constraints like data availability, data quality, and compute budgets rather than any natural standard. Of course, I am in favor of telling models directly during training what the documents are for.
That said, I am quite interested in the question of what happens if, instead of synthetic documents, real documents that we expect to make the model more cooperative, more aligned with our visions of utopia, etc. were upsampled instead. Early proposals focused mainly on documents of this kind. It seems plausible that there just aren’t enough documents to perform this sort of upsampling, but I’m not confident in that.