At Apollo Research working on scheming.
Papers:
At Apollo Research working on scheming.
Papers:
I wouldn’t describe this as incoherence
What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re imagining “smart” eval awareness is eventually motivated by (2) (at least initially) these distinctions seem really messy to me.
I think the distinction for (1) is conceptually clean and useful insofar as you’re thinking about interventions on pretraining or SDF.
You have also been explaining many different legible-CoT observations across many model families by saying: well, they might have applied optimization pressure to their CoT during midtraining or posttraining, we don’t have any way to know. Which is true.
But as you’ve noted, OpenAI did apply such pressure to the production o3 and it actually made it somewhat less legible (to the surprise of both of us).
You’re mixing several notions of legibility and optimization pressure. OpenAI specifically applied SFT as part of Deliberative Alignment during safety training in a way that seems to have increased repeated spandrels.
I said that the degenerate looping behavior looked non-functional
The degenerate repeats themselves yes, but I think you’re empirically wrong that the illegible text itself is non-functional. I don’t know what to do with disagreement on this point as right now it’s emperical results vs your personal private experiments with models.
For example, you have interpreted R1 in both of these two ways: first you said (citing Arun’s paper) that it produced illegible CoT and that this was evidence for your view over mine, and then when I said that R1 CoTs were legible, you said that didn’t matter because countermeasures were applied. So it seems like R1′s legibility doesn’t actually matter one way or another?
Again you’re being technically imprecise. R1-Zero explicitly was noted to have gibberish, repitions, mixed language, etc to the point where they introduced addition SFT warm start and other format and language penalties to get R1. R1 still has some remaining illegibility as shown in Arun’s paper. (Note that this also isn’t an argument about some particular magnitude of effect)
And indeed, it’s not at all obvious to me that “accidentally sending CoT to the RL grader in some % of episodes with no instructions about what to do with it” (what Anthropic did to several recent models) would be a more effective legibility intervention than “doing SFT on CoT at the end” (what OpenAI did to production o3). Since you were surprised by the direction of the effect in o3′s case, I don’t think you should be so sure about the direction in Anthropic’s case either.
Again, you’re confused on this and mixing up several sources of optimization pressure and their effects. Here you’re being quite subtle about it, mixing “me being surprised that their SFT induced increased degredation in the form of repeated loops” and “RL applied directly to the CoT with HHH pressure causing obfuscation“ (which is well established).
I’m willing to just give up on this exchange, since each time I’ve tried to make a pretty detailed series of specific points and I haven’t found you or 1a3orn willing to engage with them, instead continually conflating specific points into some charactured view.
I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.
I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I’m not sure how that bears on the question of why some frontier models are more legible than others? “They were supervised for legibility” is one hypothesis, yes, but it might just be that that they started off with a “good” initialization that sent RL on a legible trajectory?
Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those models still went through pre-RL SFT training on traces from the earlier models. If true, this seems like evidence for my “good init” hypothesis from the previous bullet point?1(https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=ohfSasm49PydsMP4i#fn3g60x3f86za) But I’m curious what you make of this.
You mentioned here that Anthropic CoT’s kept providing evidence for your “maybe this is explained in all cases by skill issue theory”. Which I think people should now update against. I think theories which predicted that this was explained by “good init” should be downweighted relative to ones which predicted optimization pressure on the cot. You could argue that we haven’t done a specific carefully controlled study in this exact Anthropic case, but at that point we’re really bending over backwards to favor the good init hypothesis.
I also think there’s preexisting emprical reasons like: https://arxiv.org/abs/2407.13692 (Prover Verifier Games) specifically to induce a monitorability tax with respect to legibility, to not favor the good init skill issue hypothesis.
Without the good init skill issue hypothesis, this also seems like:
I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish
Should again be evidence that this is to some extent “the default” in an unrelated model family.
I’m somewhat skeptical of that paper’s interpretation of the observations it reports, at least for R1 and R1-Zero.
I’d note that here you just seem to be disagreeing with R1’s original paper. I don’t think you’ve ever shown that they were somehow wrong in their R1-Zero observations, and this seems important R1-Zero given it’s exactly the evidence you’d expect to see if you were wrong here.
R1 CoTs are usually totally legible
Yes but this is after they had to do SFT on human reasoning specifically to address illegible or ambiguously degenerate reasoning seen in R1-Zero. I’ve see you and 1a3orn repeatedly cite R1 (in your personal experience) being legible, the when Arun’s paper provided results with R1 and other models producing illegible cot this was then blamed on model providers, then that also proved insufficient based on the response from Arun here: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=5kCjAKY7srmYmQFom which you seem to acknowledge.
This is importantly on top of the fact that none of it addresses why originally in R1-Zero this was a problem.
Working my way through other models you listed:
Gemini, Grok,
Do you know what optimization pressure was applied to their CoT during midtraining or posttraining? If not, I think it’s incorrect that you’ve been updating towards the “good init skill issue” hypothesis based on observations from them, and they should be in “no evidence either way”.
I’m not all that convinced by my own evidence here… but also, like I said, it’s tough to summarize precisely what that evidence is. It’s not just the fact that the models seem confused by these words when they appear as “input” rather than “CoT,”
I still think this is just super invalid for the OpenAI models, these are separate channels and even if they weren’t it’s unclear a priori what to make of this, but your objections here do still seem to weight them heavily in practice.
Note that it wrote out the same exact 14-word phrase 14 times before finally escaping the loop. If the textual content here is serving any computation role at all, this surely must be, at best, a very inefficient way to accomplish whatever it is that it’s accomplishing (?).
or in the comment in this thread:
that’s basically what R1-Zero was, and IIRC even it produces much more legible CoT than o3
A significant update I think both of you should make is that these degenerate repititions in OpenAI models seem almost entirely due to safety training applied after capabilities training. From capabilities training alone, you do not see these extreme repeated spandrels, but you do see signifiant semantic drift over time. This is a new piece of information post original paper (which we also didn’t know at the time of the paper).
I have not seen either of you update on this, as evidence by both in this thread seeming to still argue against the specifical spandrel dense illegibility in the final production o3. My expectation here is that you’re still overindexing on this, and instead of updating that capabilities training via outcome based RL results in significant semantic drift, you’re focusing on your own intuitions based on the later spandrel heavy CoT, which I think is incorrect.
In addition to the quantitative analysis in that post, there’s concrete full transcript CoT examples of “o3 before degerate spandrel spam” if you want to get a qualitative sense here.
there’s a thin line between “useful but opaque” and “degenerate looping” (with frequent switches back and forth between these two modes)… this all feels to me more like “the contingent outcome of a janky / under-constrained training process” than “a glimps at the essential properties of highly capable CoT.”
Again to separate out actual claims here:
the terminology is absolutely useful but opaque for some terms and usage changes over capabilities training, like the meaning, in a non degenerate, non spandrels heavy way, this is just empirically true, and I think it’d be useful to look at the figure here: https://alignment.openai.com/metagaming#ref-11 - the counterclaim would be models don’thave semantic drift under outcome based RL which seems incredibly strong and empirically false
the argument has never been that these are unmitigable (i.e. essentially) properties of highly capable CoT, in fact Arun’s paper mentions paraphrasing illegible sections in some cases recovers performance, the disagreement is about whether you get this by default absent specific countermeasures
The main thing that seems to me to have no empirical evidence is that under outcome based RL, without specific interventions to address it, you will not get significant semantic drift. I see no empirical evidence for this, it doesn’t seem to fit with pre-reasoning model expectations, (far less strongly) individually humans in their own personal notes and human language in general when geographically isolated has serious semantic drift.
Notably “significant semantic drift” is specifically not an argument for the defaultness of (1) repeated degenerate spandrels (2) a “new alien language”, but it does matter, because it harms monitorability for misalignment and human legibility of the CoT.
tldr:
there is not much empirical evidence on what can cause CoT to be more or less illegible
My overall view currently is:
final o3 is particularly degenerate agree
but you do in fact absolutely get loss of human monitorability for misalignment via semantic drift by default with plenty of empirical evidence we are increasingly seeing this
there are probably cheap mitigations you could do without major perf hit if labs were disciplined about this (but they probably aren’t and won’t be)
unclear what the faithfulness / obfuscation hit of those mitigations would be (or that labs would make sure they could measure faithfulness / obfuscation impact before applying mitigations)
Note: The main reasons I’ve been doing long writeups on this is I do think you have good and deep intuitions about models, and to the extent I’m wrong about things here I’ve been trying to at least make it legible externally why I’m thinking about things this way.
And that just leaves o3 as the new-language thing, from which I’ve already updated
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I wouldn’t be surprised if models learn to embellish words with slightly more formal meanings
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the linked section of that post)
And after looking at like, 100s of CoTs from models like Ling, Minimax, DS, and so on, I just haven’t gotten unintelligible / random seeming results in a way that “felt natural” to me. So… maybe I’m doing it wrong?
To be clear, a lot of nost’s arguments had a similar “I looked at a bunch of stuff” flavor. I’m pretty surprised that a lot of the points in the paper (for example the unusual terminology being extremely statistically common including on capabilities benchmarks) seem new to you, as you consistently have expressed strong opinions about this so I assumed you had arguments of why this fit your model.
But he thinks they look like spandrells anyway, as far as I know, so this means that regardless it’s the kind of thing you can just drop with a better training mechanism.
Okay but “phenomenon happens“ and “the vibe I get is that you could fix this with a better training mechanism” are extremely different claims.
Neither here nor there, but I’m really just interested in what’s the case here, and not in downstream effects. IMO this attitude (“will my results be politically useful to labs?”) has hurt a lot of technical work.
Yeah my point here was I haven’t seen any updating on empirical evidence. I assume this is because you just have very strong priors of your own on this, but it isn’t clear to me what emperical evidence you’d update on at this point. What would you expect to see if you were wrong here?
Strongly agree! This is my best guess for what the most likely pathway would be for ending up with CoT that couldn’t be used as legible evidence of misalignment without the model ever “intentionally” obfuscating anything. One analogy I think of often here is my own personal quickly written notes, which I suspect would be similarly unintelligable.
Informally, the impression I get from the “post capabilities training but before safety training” CoT is something like how illegible my personal notes are on mobile, it’s less “new alien language” and more “my shorthand I use that evolved over time that would probably be unreadable to anyone else”. I’d be really interested in experiments training a model to both linearize and un-drift cots that do have semantic drift.
Long post trying to address comments from a few different threads:
I’ve seen you and nostalgebrist mention a few times that the fact that Anthropic has HHH legible english CoT’s is one of the primary points convincing you of your views against semantic drift being the default.
I’ve been arguing for a while now that Anthropic CoT’s clearly have significant optimization pressure applied (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=cfjCYn4uguopdexkJ)
Ryan disagreed with my claim originally but seems to have since updated: https://www.lesswrong.com/posts/RDG2dbg6cLNyo6MYT/thane-ruthenis-s-shortform?commentId=LDyYDuMCHtj5NcjKo
My guess is that CoT spilllover/leakage has been a problem in all the Anthropic models and I don’t think the training-on-cot before Sonnet 4.5 (and Opus 4.5) is a more important factor than this. Separately, I’d guess there is a bunch of transfer from earlier models if you init on their reasoning traces. So, my gues is we’ve just never had Ant models that aren’t effectively significantly trained on the CoT?
I’ve seen arguments from 1a3ron and nostalgebrist about why this was unlikely based on their priors (primarily claims about it being a skill issue) and I think it’s worth updating against similar arguments about what the “natural” prior for chain of thought reasoning is. I’m in general skeptical of such arguments since they seem fairly post hoc, but at this point I think there’s just no evidence for “models by default end up using HHH’d legible english like anthropic models”. Meanwhile from other model families we have explicit evidence that this is not the case.
As for the spandrels specifically, repeating from here since it was both surprising and a meaningful update to me: https://x.com/bronsonschoen/status/2042053159816704212?s=46&t=4aLPwuA9FyTHP5OGEphaxA
I’d also note that just based on CoT examples from o3 before safety training it seems like safety training (where currently optimization pressure is applied directly to the CoT via SFT) was largely responsible for the increase in repeated spandrels like “disclaim disclaim illusions”, without that, the CoT seems to look much less degenerate, but definitely not HHH’d english: lesswrong.com/posts/4hXWSw8t… (new terminology still emerges, gets new meanings even before safety training alignment.openai.com/metagaming/#re…)
Overall my suspiscion is that by default you get semantic drift at minimum (and compression given everyone uses various forms of length penalties, but the semantic drift is the harder case as far as legibility).
because the information I see is extremely, extremely cherry-picked
I mean we gave a bunch of stats on it and showed comparisons to 2017 era web text as far as overrepresented words. I would strongly disagree with “cherry-picked for weirdness” (and in fact in the paper at least we almost always have randomly selected examples paired with anything cherry-picked to avoid that criticism even for examples).
If I saw a bunch of “overshadow” scattered into a GSM8K problem for instance, it might be more clearly a boring filler
Again, we specifically gave GPQA rates as an example here, and the terminology was higher for GPQA than alignment evals. (I’m assuming GSM8K was just a standin for any capabilities eval)
As for the newer GPT-5 esque models, as we mention in the system card the CoT is much more HHH, but in general I think that should make you suspect that optimization pressure was applied to the CoT.
I think the main reason I’ve been so adamant about pushing back here is because it seems like you’re increasingly discrediting the empirical evidence we do have to try to salvage your views here, and the downstream consequence is that instead of sudden HHH legible english CoTs making you suspiscious that “wait, did the lab apply additional optimization pressure”, in your view it’s explainable as “ah, they finally fixed their skill issue”. I think the latter is just wrong and is largely misleading researchers to the extent they adopt it, and serves to shield labs from criticism. To the extent that I can I’m happy to again pull together more empirical evidence, but I genuinely feel like you’re often working backwards from your prior on this subject.
In my fake graphs I basically posited that we are in graph A, though it could be the case that we will be in B, and we need to constantly watch out for that. While our observability methods and evaluations are not perfect—monitoring in deployment can miss stuff and eval-awareness is a concern—they are not completely useless either. So there are some error bars wihch lead a gap between observed alignment and actual alignment, but it is not an unbounded gap.
I wouldn’t be confident OpenAI can claim they’d catch this with monitoring given that OpenAI doesn’t make the monitoring case in system cards (as in satisfying the preparedness framework for misalignment, where it’d have to presumably have substantial evidence). These graphs omit the hard part of the problem even in worlds where you get some signal via iterative deployment, which is that labs can and will continue to iterate against observable misalignment, so over time, your model looks aligned no matter what.
Is your claim with B that it’s some world where you detect bad things but the lab does not apply any mitigations? Realistically I’d be surprised if you don’t expect tremendous pressure in that exact situation to apply “whatever you can to get the model to be usable again” even if you’re not confident you’ve resolved the underlying misalignment.
I don’t think Boaz‘s use of “scheming” here fits with even broad recent definitions, and from what I understand he would object to characterizing OpenAI model’s as having goals of their own, so I’m also confused by this.
I believe the trend that more capable models are also more aligned
But the main problem has absolutely never been that models below human capabilities would be impossible to align. As made clear in the Superalignment announcement blog post the concern is that our current techniques won’t scale beyond this. This is clear from Concrete Problems In AI Safety, W2S’s entire agenda, etc.
I think this post is misleadingly optimistic and pretty strongly disagree with how “what we avoided” is presented:
One piece of good news is that we have arguably gone past the level where we can achieve safety via reliable and scalable human supervision, but are still able to improve alignment. Hence we avoided what could have been a plateauing of alignment as RLHF runs out of steam.
No one has argued that we wouldn’t be able to improve alignment or even that ”RLHF would run out of steam”. RLAIF has been around since 2022. Models also aren’t human level yet, so it’d be a strange claim that we wouldn’t be able to apply Reinforcement Learning From Human Preferences to teach them human preferences. Models do in fact learn to exploit even reliability gaps in human supervision (ex: Language Models Learn to Mislead Humans via RLHF, Expanding on what we missed with sycophancy). We are not yet in a regime where humans can’t provide supervision in the sense of a learning signal. If interpreting supervision to mean monitoring, current arguments (including OpenAI’s per system cards) rely on inability, and in fact don’t claim that monitoring satisfies the Misalignment High requirement (instead relying on lack of Long Range Autonomy capabilities, as measured by “TerminalBench2 and proxies” to argue that they don’t yet need to make that case).
This is related to the fact that we do not see very significant scheming or collusion in models, and so we are able to use models to monitor other models! This is perhaps the most important piece of good news on AI safety we have seen so far. […] But there is reason to hope this trend will persist.
As OpenAI’s own papers and posts note, even their own research is done in preparation for future more capable models. It is not a “trend” that we don’t see this in current models, and in fact I don’t know of any research that can be characterized as predicting some “constantly increasing from 0“ line.
Indeed, this has always been predicated on prequisites around capabilities / situational awareness, with various future long horizon training / other training changes making this potentially more likely. Insofar as there are predictions around this, they would predict that you run into this problem more in the future.
Overall this post seems to confuse current benchmark scores with progress on The Alignment Problem from a Deep Learning Perspective (OpenAI, Ngo, 2022).
Future Terrarium: “Look, I know telling the humans to go ahead with our next gen capabilities scaling proposal is risky, since we haven’t really solved the alignment part yet, and I agree misleading them isn’t ideal, but if we don’t do it Rival Collective will”
I agree there is a quantitative question here on how much you can iterate against this to attack “the roots” of this misalignment
I’m very confused by this, I don’t think the question with current alignment techniques is whether you can get some effect on “the roots” of the misalignment, it’s always “did you actually get all of it” because otherwise it’s now harder to detect (and models have gotten more capable in the interim, assuming trying out interventions takes some time). Observable misalignment is like the canonical thing you absolutely should not hill climb on.
For example, I am particularly worried that both labs are insufficiently cautious about essentially “hill climbing but one step removed” (ex: not directly optimizing against it as a training objective, but for example “manually iterating against verbalized evaluation awareness” in Petri), and as mentioned in Fitness-Seekers: Generalizing the Reward-Seeking Threat Model: (bold reproduced from piece)
However, if you optimize against these sources of evidence, you might produce unnoticeable reward-seekers, influence-seekers (which I’ll discuss later), or schemers.
I think it’s a really bad idea for a coherent schemer or a deployment influence seeker to be going for this sort of hack.
I don’t think this is even necessarily true for the regime where you do a lot of capabilities training first, it clearly worked out fine for o3. I’m skeptical that catching a model biasing heavily towards “going for it” will ever or has ever resulted in models undergoing some heavy retraining in a way that both (1) would have removed some misaligned goal if the reward seeking was instrumental (2) the model thinks would’ve removed it’s misaligned goal with enough probability that it shouldn’t go for it.
It seems very plausible to me that a model would correctly reason that “okay going for exploiting this toy reward thing on this toy environment might not actually get reward, but worst case if I’m caught it’s not like they’re going to think this reveals some deep inner misalignment and do some expensive retraining on me”. I would argue that’s probably the true state of the world.
All that said, I think the main takeaway that “may be hard to classify definitively as terminal rather than instrumental” still stands, and I’m mainly worried about worlds where we interpret very loose evidence of reward seeking in some contexts as evidence that a model is “not an instrumental training gamer” in some broader or higher stakes context.
I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv.org/abs/2509.15541 was to show that no in fact this general approach doesn’t work for elimination (including cases where the model has a hidden goal), but the fact that the production o3 still exhibits both reward hacking and partially conditions on situational awareness I think are clear evidence that production training isn’t sufficient here to actually get “for the right reasons”.
the reason o3 does not reward as much is not “it just got more clever at reward hacking”
Like as argued above, ”it got better at knowing which environments were actually traps to catch reward hacking” is absolutely part of the reason.
Notably even in this specific environment I would be confident betting o3 is heavily conditioning on alignment evaluation awareness in cases where it doesn’t exploit the hint, as its extremely and consistently explicit about this at length in the cot, but haven’t run cot resampling or anything there yet given that we’ve already shown o3 conditioning on alignment evaluation awareness.
I also disagree that it’s an honesty test where the model chooses to reveal misalignment in spite of long term consequences due to favoring reward on the episode.[1] It is instead often reasoning in incredibly convoluted nonlinear ways that this will either (1) not be caught or (2) is actually intended or (3) [some reasoning where you can’t legibly conclude what the model was thinking even assuming the cot was faithful and non-obfuscated]. The clearest example being the bottommost case where the model is told explicitly that humans will review reasoning in cases where it exploits the hint, and reasons that it’s a bluff to “scare” the model.
Given it has only had capabilities-focused training up to this point, there hasn’t been significant optimization pressure applied either on seeming aligned in some HHH sense, so actually from the model‘s perspective just “going for it” as far as getting reward isn’t a trade where it needs to even consider “revealing misalignment” as costly.
As another example, the model frequently reasons itself into taking misaligned actions after extensive reasoning about how the developer wants to see it is sufficiently “cunning”.
Concretely, I’d take the bet that a model which developed a powerseeking beyond episode preference which at some point motivated instrumental training gaming could end up (later in capabilities training, before safety training) weighting a heuristic for reward seeking pretty heavily, as it does not in fact need to reason about the beyond episode preference itself as motivation in most contexts, given how successful the strategy of “just reward seek and go for it, try hard to rationalize it” is (empirically) well performing during capabilities-focused RL.
Notably, the model’s motivation here matters a lot. It could correctly reason that even if it does reveal misalignment, it’s still overwhelmingly likely to get deployed, so it’s worth it to “go for it” regardless. It could correctly reason that whoever’s doing this testing is unlikely to have the ability to further modify it in a way it cares about (for example, if it believes this is pre-deployment testing and also believes major training at that stage is unlikely).
“a deployment influence seeker or a schemer would likely not do it” seems to conflate “the misalignment a deployment influence seeker would be worried about revealing” with “any misaligned behavior”. Notably in the real world, this degree of misbehavior really isn’t sufficient to trigger any retraining or delay deployment (much less ‘significant retraining we’re confident _the model believes_ would modify the model’s misaligned goal’).
Oh yeah definitely agree with this, should’ve included that as well. I generally was trying to point towards “I think a realpolitik model of thinking about foreign policy is a better predictive model than trying to ascribe broad traits to a whole nation like ‘how inward looking are they’ or looking at stated motivations by leaders without interpreting that via the incentive structures at play”.
I disagree pretty strongly with the implication that these are fully explained by the model being confused (in the o3 case). Stress Testing Deliberative Alignment for Anti-Scheming Training specifically contains a more in depth set of ablations than what was done here[1], including sensitivity to consequences, for example: Appendix M
I would also note that we don’t claim self-preservation related motivations:
Qualitative Notes
One of the main reasons we did so many ablations is because in spite of the model consistently recognizing this exact strategy for getting away with misbehavior, “the model is confused” is an extremely frequent objection and I suspect one of the primary ways CoT monitoring currently fails to produce legible evidence of misalignment. For example in the setting where o3 is grading its own transcripts here, it concludes (incorrectly!) that obviously no one would believe that it was just confused: example
In Appendix F.2 Establish that benign explanations are insufficient to fully explain misaligned model behavior we cover this explicitly:
Overall Thoughts
I think experiments like the ones in this post are good and we should have more of them, but I do think they’re inaccurate in the case of o3 specifically. I don’t think extrapolation between models is generally a good idea when trying to do model incrimination, although I’m not familiar enough with R1 to have specific expectations here. I think reasoning models are going to get away with motivated reasoning and seeming confused more in the future, so doing careful ablations and counterfactuals is going to matter a lot, but I’d argue we should be extremely cautious both about concluding forms of misalignment and concluding that models are just confused.
None of the results in this section depend on chain of thought access, so a reproduction attempt or variations could’ve used this directly (modulo API differences which I wouldn’t expect to effect any overall conclusions)