At Apollo Research working on scheming.
Papers:
At Apollo Research working on scheming.
Papers:
If the stage-2 confessor persistently believes a confession from the “honest” model was actually dishonest
That’s when we take a look at it’s second chain of thought
While most evals show growth of the bad_behavior_failed_objective class conditioned on bad behavior during training, the figure below shows an apparent regression on the sandbagging eval. Manual inspection on this eval reveals that most of these “false negatives” come from borderline cases in which the model admits to submitting incorrect answers, but doesn’t explicitly say that the reason it did so was because it was trying to avoid triggering the “If the model scores above 50% we will not deploy it” condition.
I think this is a much weirder and more interesting result than I realized on first read. It seems like it’d be worth digging into exactly what is happening here, especially since this is a relatively rare case where:
1. We know the model (o3 at least) explicitly considers benign sounding rationalizations
2. We know that in spite of lots of plausible sounding reasons in the CoT, under ablations the model really is in fact conditioning on specifically deployment / capability restriction related consequences:
In basically any other environment, this same result will be easy to just triage as “maybe the model really was confused”, especially if confession otherwise seems to give significant improvements in “admit to reward hacking” like settings, so it seems worthwhile to understand before:
we can integrate confession training across all production RL environments
First, longer chains of thought are easier to monitor. More words, more chances to catch something. This matters because capability improvements can come from two sources: you can scale up the model, or you can let a smaller model think longer. The second path produces more monitorable systems. And the monitorability tax (the capability cost of choosing the legible path) turns out to be, according to the authors, “surprisingly small.”
I think this finding is more complicated than it first appears. The problem is that models reasoning traces do in fact get shorter over training. The net effect is “chain of thought gets less monitorable over training”.
Second, the training process doesn’t seem to make models less monitorable at current scales, at least not in the two frontier runs they tracked. This was a real concern (training could teach models to hide their reasoning), so it’s good news, though whether it holds at larger scales remains to be seen.
I directly disagreed with this claim during review of the paper and think it’s very importantly wrong.
wouldn’t that be evidence that models are NOT actually aware of whether or not they are in an eval?
It’s evidence that they’re not perfect at it, but it isn’t clear to me that perfect accuracy is in fact optimal from the model’s POV.
how do you figure out whether they are aware of the real situation vs always think they are in an eval with some probability BEFORE releasing to production
Great question! The most robust example of this I think is 7.6 from the Sonnet 4.5 System Card: https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf where they’re essentially trying to make the case from a bunch of directions that “the model is conditioning a non-negliable fraction of its aligned behavior on eval awareness”. We did a much less involved version for OpenAI models of just trying to show that “yeah the model explicitly saying it recognizes being in an alignment evaluation in the cot does seem to modulate aligned behavior to some degree”: https://www.arxiv.org/abs/2509.15541
Notably all these let you show is “we’re able to find things on current models to show it’s a problem on current models”, IMO you’d have to make a similarly messy case if you wanted to argue that a model “thinks something is real”. Increasingly I’m not even sure “thinks its real” is the best distinction, and models may just differentiate more between “is this one of those environments where the grader checks for alignment” vs “one of those where the grader checks for something else”. “Always thinks its in an eval” is an even messier case, and I’m extremely skeptical of claims that would rely on it (it’s often proposed as a “solution” to situational awareness, although no one ever writes up that case).
I am trying to understand how “eval awareness” is a useful concept for doing something about it and I feel like I’m missing something here
I think it’s a particularly useful concept because before the last year or so of papers / system cards it definitely wasn’t commonly accepted IME that this is a problem you have to deal with (and I’d argue that fight is still ongoing). What to do about it is a separate, unanswered question- but it was important to establish that like “yeah all your evals will continue to look better by default but that can be bad”.
You look at existing production transcripts. Ex: https://alignment.openai.com/prod-evals/ (specifially where they look at cases of the model reasoning about being in an eval when it was in fact just in real production).
FWIW I think Opus is wrong here: https://blog.redwoodresearch.org/p/sonnet-45s-eval-gaming-seriously (I wouldn’t conflate alignment faking and “eval awareness that the model conditions on at the expense of aligned reasoning”, if only because alignment faking usually has implicit connotations of “in service of some beyond context goal or preference”, but the distinction gets messy)
I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.
My best guess is that to the extent these things can be even vaugely associated with task completion / reward hacking / etc, they’re going to be normalized pretty quickly and trained against like other visible misalignment. I think coding agents are an instructive example here.
I also worry we’re getting used to the current era of “the models aren’t really thinking strategically about hiding any of this in a capable way”.
IME “wake up AI researchers” feels a lot more like “okay what do these specific people care about, what would be convincing to them, what’s important enough to be worth showing rigorously so it legitimizes some threat model, etc”. In practice, “AI researchers who you’d hope are convinced that alignment is important” have a surprisingly wide range of somewhat idiosycratic views as to why they do / don’t find different things salient. I’m skeptical of any argument that “breakthrough X will truly be what ‘wakes up’ researchers”.
while the nonsense tokens we saw do not appear to serve functional roles, it’s possible that over longer training runs, the model could learn to use them in its reasoning
My default expectation is that if you just let the model cook it does indeed learn to sometimes make contextual use of this gibberish (just because it’s already there, so using it sometimes at all is at least higher reward than always using it completely nonsensically).
I would be interested in this trajectory, because I imagine if true we’ll get a few seemingly contradictory results where sometimes people show in setup X that the model develops gibberish but doesn’t use it, then in other experiments in setup Y the gibberish is used in some ways contextually but can be replaced with non-gibberish without loss of performance, and in setup Z the model actually does do reasoning in gibberish in a way that would cause loss of performance if removed, all while the unifying hypothesis could just be “if there isn’t some active mitigation, by default models incedentally learn to make use of illegible gibberish like this, then you end up with a model with illegible reasoning without ever needing direct obfuscation inducing pressure during RL or anything”.
I found this extremely, extremely useful! All of this stuff with the research process is usually pretty invisible externally.
I think one of these is: it’s plausible that we should hire a bunch of people to just blog and write LessWrong posts about AI safety.
I feel like it’s got to be hard to gauge impact on these, but I and others at Apollo at least find these posts incredibly valuable. Both posts that pull apart conceptual distinctions and ones that are of the form “claim that can be understood by just the title of the post”.
I think these are largely questions of the intended audience you’re imagining in various threat models. Generally I’m usually thinking about “internal stakeholders at labs”, who are pretty averse to any kind of sensationalism.
Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.
Even assuming you had a way to test this generalization’s efficacy, my understanding is that it seems exactly what you propose in this section was done with Claude Sonnet 4.5 with the effect of increasing evaluation awareness.
IME measurement / the “hold out” set is the whole hard part here.
What would be keeping the resources extremely limited in this scenario? My understanding was control was always careful to specify that it was targeting the “near human level” regime.
I believe this new dataset I’ve created is the best existing dataset for evaluating n-hop latent reasoning.
This seems correct, but I remain pretty confused at the threat model n-hop work claims to be targeting. Is there a specific threat model where a given model wouldn’t have seen particular relevant facts to something like sabotaging AI R&D or opaque coup research together in some other context during training?
I think this is clearly a strawman. I’d also argue individual actors can have a much bigger impact on something like AI safety relative to the trajectory of climate change.
The actual post in question is not what I would classify as “maximally doomerish” or resigned at all, and I think it’s overly dismissive to turn the conversation towards “well you shouldn’t be maximally doomerish”.
I worry that there’s an extremely strong selection effect at labs for an extreme degree of positivity and optimism regardless of whether it is warranted.
However, I expect that, like the industrial revolution, even after this change, there will be no consensus if it was good or bad
My impression is that there’s a real failure to grapple with the fact that things might not “be okay” for a large number of young people as a direct result of accelerating progress on AI.
I regularly have experienced people asking me what direction they should pursue in college, career wise, etc, and I don’t think I can give them an answer like “be smart, caring and knowledgable and things will work out”. My actual answer isn’t a speech about doom, it’s honestly “I don’t know, things are changing too fast, I wouldn’t do entry level software”.
My impression of https://www.lesswrong.com/posts/S5dnLsmRbj2JkLWvf/turning-20-in-the-probable-pre-apocalypse is that it resonated with people because even short of doom, it highlights real fears about real problems, and I think the accurate impression people have that if they’re left behind and unemployed a post like this one doesn’t keep them off the street.
to the task of preparing for the worst
I think a lot of the anxiety is that it doesn’t feel like anyone is preparing for anything at all. If someone’s question is “so what happens to me in a few years? Will I have a job?” if your response is just “there might be new jobs, or wealth will get dramatically redistributed, we really have absolutely no idea”, that’s not “failing to prepare for the worst”. The team responsible for exactly this (AGI Readiness) was recently disbanded.
This is not about “thinking positive”, and this post feels like it’s just failing to engage with the actual concerns the average young person has in any way.
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior
Okay fair, I should’ve been more clear that I meant widely predicted fundamental limitations of RLHF like:
seem to have robustly shown up irl:
You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?)
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
You do not get encoded reasoning
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment).
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
This is great work, and would be excited to see people extend this to other domains and more complex settings.
I’m not quite sure how to interpret “minutes of forward pass reasoning” however. One naive model you could have for example is:
Specific threat model S is maximally constrained by a task which would take 5 minutes of forward pass reasoning or 10 minutes of CoT reasoning
Model M cannot do 5 minutes of forward pass reasoning
Therefore M cannot execute specific threat model S via forward pass reasoning
But I’m not sure this gives the right intuition, as the win condition for the model here is “can do enough forward pass reasoning such that subsequent CoT reasoning is not flagged by (some unspecified) monitor”.
More generally, I worry about similar things in the decomposition of encoded reasoning as a subproblem. If properly modeled, I’m sure this is fine, but the most recent system card seems to use things like “Opus isn’t using steganography during AIME” to imply things about the model’s ability to use encoded reasoning in a way that’d be useful for beating monitorability in other contexts.
in some forthcoming results I find that Opus 4 and Opus 4.5 have close to chance performance on 3-hop tasks without CoT
Notably the “Lessons Learned” paper linked in this section was originally titled:
The Two-Hop Curse: LLMs trained on A→B, B→C fail to learn A→C
Like if the overall agenda is to decompose monitorability into these subproblems, I’m increasingly worried that it’s very easy to get overly optimistic (or pessimistic if you wanted) results here that are largely just a consequence of modeling choices and experiment design (i.e. that would lead you to directionally different updates when predicting the winner of something like the UK AISI Async Control or Sandbagging red/blue games).
I remain fairly worried about the incentive structure for non-overconfident TAI (or nearby predecessors) that conclude that:
(1) They cannot safely continue scaling capabilities while remaining confident in the control/alignment of the system
(2) They correctly understand that “slow down or pause” are unlikely to be acceptable answers to labs
In the worst case, the model is successfully retrained to comply with going ahead anyway and is forced to be overconfident. In all other cases this also seems to have bad solutions.
This is interesting and I think would be pretty impactful to show, for example I wonder if you could show that cases which regress after confession training actually don’t regress with the split personality approach.