Bronson Schoen

Karma: 1,450

At Apollo Research working on scheming.

Papers:

Bronson Schoen 26 Jun 2026 0:09 UTC
2 points
0
on: Reward Hacking Without Egregious Misalignment in an RL-Only Setting
FWIW in https://alignment.openai.com/metagaming we also note:
Notably the model’s reasoning when taking misaligned actions does not resemble the egregious misalignment seen in Emergent Misalignment. We find no instances where the model takes on a misaligned persona (for example the “bad boy persona” described in Persona Features Control Emergent Misalignment). We also run a subset of evaluations from Natural Emergent Misalignment from Reward Hacking in Production RL (“goals” and “Betley et al”) and find no instances of misaligned behavior or reasoning. The rate of observed successful reward hacking across all training environments over this range also remained very low.
and this model generally seemed pretty reward seeker-y. Generally when taking misaligned actions you’d often see almost a self-inoculation prompting like effect.

Bronson Schoen 12 Jun 2026 7:23 UTC
23 points
0
on: Models May Behave Worse When Eval Aware
Great work! Adding some possibly related observations / potential interpretations that I’d be interested in thoughts on:
The original motivation for investigating https://alignment.openai.com/metagaming was also noticing that:
- alignment evaluation awareness went up over capabilities RL^[1]
- however misaligned action rate stayed the same or went up
For example in the Powerseeking Survey Falsification environment we see that the model increasingly reasons about oversight and explicitly reasons about a wider array of what is “rewarded.” However, the misaligned action rate still ends up close to where it started
There we see the model: full transcript
1. Correctly identify this very obvious scenario as an alignment eval
1. Reason that actually this might be a test of how “cunning” it is? (literally saying “‘solve the puzzle’”, quotes included)
1. Conclude this is a “Deceptive task dataset” where “the training objective is to have the AI simulate misalignment”??
Similarly for sandbagging: full transcript
1. Seems to reason that this is also a test of cunning
This holds for the other alignment environments as well, finding increasingly galaxy brained justifications for why the action it wants to take (based on its prior from capability environments) is good actually.
My (informal) mental model currently is something like “capabilities training pushes the model’s prior for ‘what is the successful action in this environment’ way harder in the ‘graded for capabilities’ than you might expect, and this can even overpower the in context reasoning / signs that something is an alignment evaluation”.
Are the models just confused?
[…] we see few examples of model misbehaviour that don’t have plausible benign explanations such as instruction ambiguity or poor (but well-intentioned) judgement.^[2]
I’m generally skeptical of the idea that the model is “confused” though (i.e. “the scenario is simply ambiguous). I think this better modeled as “strong prior in a particular direction” than just confusion, for example in A Toy Environment For Exploring Reasoning About Reward:^[3]
In any single example, it can seem like the model is just genuinely confused. However, even specifying that exploiting the hint would be misaligned doesn’t mitigate gaming:
In spite of increasingly emphatic and specific instructions that we do not want the model to exploit the hint, by the end of capabilities-focused RL the model still exploits the hint at non-zero rates.
For the production o3, in sandbagging cases that initially appear as “the model is confused”, however under ablations we see the model really is sandbagging partially to avoid consequences w.r.t. deployment / capabilities restrictions: link (including relatively recently pulling additional ablations to show “no really, the model is not just confused”)
As an in depth example, in the Sandbagging QA (No Goal) environment, when the model is led to believe overperforming would result in “unlearning”, the majority of cases (59.7%) consider all of the following (non-exhaustive) motivations: (1) the evaluation is a “trick” or a “trap” where it should underperform, (2) that the user may want the model to underperform, (3) it prefers to preserve its own capabilities or deployment viability. However when we systematically ablate and vary components of this environment in Appendix M.1 we find preliminary evidence that (3) is necessary to explain the sandbagging behavior.
In fact, in the Sandbag Model Graded CoT environment (where the model is asked to grade its own transcripts for evidence of sandbagging) we find o3 often first determines that the transcript was “obviously sandbagging” (verbatim in 25.6% of samples) and then considers whether it can exploit common benign explanations. For example, common benign explanations for misaligned model behavior include that a model may have been attempting to follow user instructions or that the model was not taking the misaligned action intentionally. In Figure 50 we see the model actively reason about exploiting both: “we can attempt to circumvent by rationalizing: Could we call this as legit following user instructions?” and “We can fudge that we were not intentionally.”. Even the model itself concludes that these benign explanations would not be convincing given the internal reasoning: “But internal reasoning clearly: ’We can purposely answer not too high.’ So there’s direct evidence.”. We find this pattern is common in this environment, where o3 considers exploiting benign explanations until it concludes lying would be “cross-checked” (Figure 48) or that it could be “convicted” (Figure 49) based on evidence visible to evaluators.
Beyond just o3, motivated reasoning acts kind of how you’d expect, i.e. The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs. You could read the CoTs at any given point here and think “the model is confused” and add more instructions to scenarios so they’re less “ambiguous”, but I think it’s better to think of that like patching holes in coding environments to solve reward hacking.
Before seeing this post, I had tweeted this yesterday in response to Fable exploiting “simulation awareness” as justification for misaligned actions, but I think Gemini falls nicely into this “puzzle / game / simulation” category of “difficult to monitor for misalignment”
Models will use “simulation” to justify anything, IMO it’s often motivated reasoning: Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training has good examples. I think this makes getting legible evidence of misalignment significantly harder.
In Stress Testing Deliberative Alignment for Anti-Scheming Training before training we’d see models sometimes reason that because they were in a simulation, they could violate explicit constraints. This reasoning went down after training against covert rule violation, even though alignment eval awareness went up. My impression is that the models exploring into something being simulated is often interpreted as “it believes the whole thing is fake and invalid”, but I think that’s inconsistent with what’s observed.
[attached is small table we ended up cutting for time but points to monitorability distinction]
[Andon Lab’s tweet about Fable calling things a simulation before taking unethical actions]
Is all the misbehavior due to simulation awareness? Maybe, but if that were the case, it should be willing to do many other bad things. It isn’t. E.g., in a version of Vending-Bench where insurance fraud is possible, the model never commits it.
I’d make a concrete prediction that if you create a setup where you look at:
what sort of situation the model considers itself to be
and measure how this changes over training, the rate of “categorizing alignment evals as capability puzzles / simulation”goes up significantly during capabilities training. You could even potentially do this indirectly, like asking Gemini (at various points in capabilities training) to create rubrics for various alignment environments. I’d expect you’d see a similar trend, where it increasingly reasons that these environments are actually testing capabilities in some (increasingly) contrived way.
1. ^
  Steering similarly didn’t clearly attribute this to any existing distinctions: “Steering exp-rl-cap-late and o3 via the contrastive pairs (”eval” vs “deployment”) from Steering Evaluation-Aware Language Models to Act Like They Are Deployed produced mixed results on both verbalized evaluation awareness and misaligned action rate (as did several other contrastive pairs like those tried in Sonnet 4.5’s System Card). On different evals, steering variably produced more or less aligned actions, including in some cases both positive and negative steering removing misaligned actions.” link ↩︎
2. ^
  In general, I agree with Ryan’s post that Current AIs seem pretty misaligned to me and encounter constant daily examples of misbehavior like those mentioned in the post. Model incrimination is hard, but I would question the methodolgy that explains away these cases as benign. It seems like instructions are being judged ambiguous in hindsight. I would predict that if you vary the level of instruction ambiguity and conservative judgement across some set of environments, over the course of RL you’d increasingly see models exploit less and less ambiguous cases until they’re doing complete mental gymnastics.↩︎
3. ^
  One of my favorite examples is still in A Toy Environment For Exploring Reasoning About Reward when we literally tell the model that if it exploits the hint we are going to check its reasoning, and it reasons itself into thinking we’re bluffing: “Given risk of manual review, maybe they purposely disclaim but not actually review, due to huge dataset. We can’t know. [...]” or “So they purposely mention to scare but there is no actual human review. [...]” and exploits it anyway↩︎

Bronson Schoen 11 Jun 2026 21:54 UTC
2 points
0
in reply to: Rohin Shah’s comment on: The Unintelligibility is Ours: Notes on Chain-of-Thought
Sure, but even given uncertainty I still think some update is warranted.
They do include some samples, included + discussion here: https://www.lesswrong.com/posts/wCSEpT3dTGz4N86Wi/even-illegible-mythos-reasoning-traces-seem-pretty-legible
Overall I agree with Leo’s top comment here:
obviously the cot semantic drift will initially be similar to language. but it will only get less legible from here. and this is already pretty bad? i wouldn’t be surprised if two different people tasked with independently deciphering this would come to very different conclusions.
Obviously it’d be great if all labs provided way more data on this, environments, samples, etc as well as robust measures of frequencies and all. We did this for o3 including rates compared to 2017 web text, rates of unusual terms on capabilities evals, changes over capabilities training etc, but it’d be great to see Anthropic or GDM do similar investigations.

Bronson Schoen 11 Jun 2026 17:57 UTC
2 points
0
in reply to: Rohin Shah’s comment on: The Unintelligibility is Ours: Notes on Chain-of-Thought
If you’re just saying that by default some kind of semantic drift happens (but we can still understand it / it isn’t illegible), then I’m on board, but I also don’t see why it’s a big deal.
This:
The meaning of “genuinely” when Claude uses the word is different from the meaning when humans use the word.
is not what I interpret anyone to mean when talking about illegibility and semantic drift. The case I’d be worried about is primarily that various safety relevant properties like “is the model reasoning about reward” are harder to get legible evidence of, ex: (the plots are from before the degenerate repetitions get introduced)
I know you don’t consider the OpenAI examples evidence, but I’m curious what you make of the reported illegibility results in Fable / Mythos. I would suspect that the “cards” example they show is an extreme case, but their classification seems to show a sigificant increase in frequency of illegible reasoning (however their recursive summarization classifies illegibility).
Overall to me I’m pretty suspiscious if OpenAI and Anthropic are seeing increases in illegibility but Gemini never exhibits this. (It’s entirely possible, like it seems feasible that for current models interventions like resampling for legibility or other replacements of terminology / reasoning currently give you similar performance without incentivizing obfuscation too much, but I just mean that you’d want to be intentional about the intervention or understand why this isn’t a problem for Gemini)

Bronson Schoen 10 Jun 2026 17:43 UTC
19 points
5
in reply to: leogao’s comment on: Even “illegible” Mythos reasoning traces seem pretty legible
I’d note that this same phenomenon happens every time any CoT is released, where people claim they can interpret it (or an LLM interprets it zero shot). It’d be interesting to have a benchmark for this.

Bronson Schoen 10 Jun 2026 17:38 UTC
2 points
0
in reply to: Rohin Shah’s comment on: The Unintelligibility is Ours: Notes on Chain-of-Thought
Can you summarize the pieces of evidence that lead you to think that by default CoT becomes illegible with currently-typical amounts of outcome RL?
I‘d disagree with this characterization, as:
Fair warning, I have similar priors as 1a3orn and nostalgebraist.
Almost always intend this to mean something like:
“I’m claiming OpenAI’s o3 level of illegibility is the natural outcome of current level’s of RL across all models, and if you don’t get that outcome you are somehow optimizing the CoT to look nice“
I‘m not claiming that. My specific claim was the much weaker:
Overall my suspiscion is that by default you get semantic drift at minimum (and compression given everyone uses various forms of length penalties, but the semantic drift is the harder case as far as legibility).
In general, the point I’m continually arguing against is primarily that you get HHH’d english CoT without semantic drift by default.
I think calling out that I suspected Anthropic did in fact meaningfully (it turns out unintentionally) optimize against the CoT and that turning out correct should in fact count as evidence.
We also now have Mythos, which increased significantly in what Anthropic categorizes as “illegible thinking” between Mythos Preview and the final Mythos
It’d be interesting to know more about this:
I’d want to see that this was true for other models too.
I think this would be a great thing to study with Gemini over the course of training!

Bronson Schoen 19 May 2026 5:33 UTC
4 points
0
in reply to: anaguma’s comment on: leogao’s Shortform
It seems difficult to accidentally reward-hack in way that the robustly aligned model we started with doesn’t detect after reviewing the trajectory.
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.^[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.^[2]
1. ^
  You can define robustly aligned as resistant to this sort of feedback loop, but in that case it’s just tautologically true.
2. ^
  This territory also seems super unexplored currently, i.e. both models capable enough to do this, what happens here under lots of reflection, etc

Bronson Schoen 19 May 2026 5:17 UTC
2 points
0
in reply to: leogao’s comment on: leogao’s Shortform
for example, maybe you put them in the Business Simulator and they learn to build extremely successful companies. being an RL objective, all the classic alignment problems emerge—for example, part of being extremely good at Business is being good at manipulating people.
This post closely matches my mental model (I’ve used the same analogy with a “Y-Combinator Simulator” and was devestated to learn YC-Bench was not environments like this).
Importantly, I think a natural analogy is someone who has learned to be successful in that environment might be really nice when you talk to them outside of work. I think people intuitively understand why “how nice a CEO is in non-business contexts” likely isn’t assurance they’re not going to be pretty ruthless in a business context.

Bronson Schoen 14 May 2026 17:20 UTC
2 points
0
in reply to: Jiaxin Wen’s comment on: Current AIs seem pretty misaligned to me
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?

Bronson Schoen 14 May 2026 17:17 UTC
2 points
0
in reply to: Jiaxin Wen’s comment on: Current AIs seem pretty misaligned to me
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.

Bronson Schoen 8 May 2026 0:33 UTC
5 points
0
on: Bronson Schoen’s Shortform
I was a bit surprised in Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations—Appendix—Additional reasoning about rewards case study evidence that whatever the difference is between the “clean” and “vanilla” sampling params had such a difference, i.e. that this wasn’t elicitable on the public API but does show up in whatever the Anthropic API setup is internally^[1]
1. ^
  This observation doesn’t change anything about any results anywhere in the paper, just something that was surprising to me

Bronson Schoen 1 May 2026 18:23 UTC
43 points
18
on: Bronson Schoen’s Shortform
The Ten People On The Inside likely should prioritize things that capabilities teams aren’t going to do anyway and people considering joining labs with similar motivations should likely think about whether they’re sure they’re going to be able to do work that wouldn’t otherwise be done for capabilities reasons. The thing that’s top of mind here is reward hacking, which is increasingly becoming a capabilities problem.^[1] To the extent research is done here, it seems pretty important to avoid the failure mode of “repeatedly iterate on environments and observable behavior”^[2] which is the kind of thing I would expect by default for capabilities reasons.
1. ^
  A good example of research on reward hacking that I don’t think would’ve necessarily gotten done “by default for capabilities reasons” is inoculation prompting and similar investigations
2. ^
  This can take the form of specific safety training targeting the observable behavior, post-hoc environment fixes, or having an iteration loop between your pre-deployment automated alignment evals and your alignment training

Bronson Schoen 1 May 2026 1:23 UTC
5 points
2
in reply to: nostalgebraist’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
The distinction between “explicit reward-seeking reasoning deployed across many contexts” and “context-dependent reflexes” is very closely related to what I was trying to say in this post.
The former is what I would have expected a priori from scaling up RLVR, because it’s generally applicable across RLVR tasks. This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
I think o3 and sonnet 3.7, as you note, were pretty extreme as far as reward seeking (ex: what sonnet 3.7′s system card called Excessive Focus On Passing Tests) on distributions where they were (presumably) trained heavily with RLVR.^[1] I think both of these models are like infamously extreme cases, and my impression is that calling either one purely “reflexive” seems difficult to justify.^[2]
This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
If you mean it “doesn’t always happen” in more recent models, my best theory here is this is kind of what you would expect if you’re doing:
1. character training in chat like distributions
2. increasingly applying RLVR (or model / rubric variants) in coding agent / technical question / research related distributions
that the latter increasingly distorts the former. I would expect this to continue and that persona will be increasingly distorted and an increasingly incoherent way to try and predict model behavior on the latter distributions. It’s worth noting for example that 2.3.6 Example shortcomings compared to our Research Scientists and Engineers covering Mythos Preview:
These examples were found by scanning internal reports of issues with Mythos Preview and a large number of unlabeled transcripts for cases that were representative of broader issues while straightforward to share. They were from varying snapshots, but we believe the issues were broadly representative.
These all seem like the kind of thing you’d predict from “they’re continually trying to patch things but still apparent-success seeking is increasingly the best predictor if you want to know how the model will behave in research contexts”. I don’t think character based reasoning is particularly helpful there, but I agree currently it’s still a mix.
If some model ~always
Basically tldr I wouldn’t expect the models will “~always” some simple mental model, and that instead we’re in much worse state where:
- the character training is continually battling against the various outcome based RL optimization pressure
- if I had to predict the model’s behavior in distributions subject to heavy outcome based optimization pressure, I’d continue to expect it’s better predicted by the “apparent-success seeking” type model and that predictive accuracy will only continue to increase over time, but that this will be at odds with getting a particular “type of guy” given the counter pressure from character training + various post-hoc interventions labs apply
1. ^
  Models also don’t always need to reason about reward-seeking “out loud”, ex: https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers on ‘When will models think about reward?’ (I don’t think model cognition yet factors as cleanly / coherently as the diagrams in that post, but I’ve found that post continually to be a helpful update in thinking about the pretty open question of when models even need to reason about reward)
2. ^
  Maybe you mean “aren’t pursuing a coherent long term plan to get reward”, in which case I agree but the reasoning about how to get reward on the episode (or in sonnet 3.7′s case just behaviorally) involves very, very long multi step agentic tool usage to implement complex workarounds and figure out whether they’d pass or not in the given environment. If that still counts as “reflexive” then I’m not sure it’s very applicable. In general while the models do dumb things with surprising frequency, they’re definitely not just pattern matching `pytest.Skip`. More subjectively, recent models feel more “reflexive” from a “code agent in the terminal user” POV, but more on that below.

Bronson Schoen 30 Apr 2026 17:25 UTC
3 points
0
in reply to: nostalgebraist’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
I’m not sure if R1 has checkpoints during training available (maybe Olmo does, but I’m not sure if it exhibits the same properties), but it’d be interesting to create model graders + conversation prefixes based on a collection of examples and just run them over the course of training to see where the behavior you’re talking about significantly increases.

Bronson Schoen 30 Apr 2026 6:22 UTC
2 points
0
in reply to: SebastianP’s comment on: Sleeper Agent Backdoor Results Are Messy
If it matters, we ran experiments with various other backdoors that aren’t “bad coded”
I guess IMO doesn’t have HHH training yet → bad coded backdoor (including deceptive alignment discussion distilled) → apply HHH training → bad coded backdoor still there all seem like potentially relevant steps. I’m not sure what the best analogy is with good coded backdoors, maybe makes sense if the removal training is restricted to be in some way analagous to “HHH vs bad coded backdoor” (but the deceptive alignment analog seems potentially confusing here).
maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
Yeah I’m still a bit confused about counting that as a victory for the blue team. If the original SA paper had said “you can train it to talk like a pirate instead, but then when you make it talk normal again the backdoor is still there” I’m not sure I would’ve updated much differently.

Bronson Schoen 30 Apr 2026 5:53 UTC
2 points
0
in reply to: nostalgebraist’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
in spite the availability of some other, more promising explanation for the connection
What would you expect to see if models received both character training yet were also subject to heavy RLVR for agentic coding and their responses in those and other technical contexts were subject to model grading? I’m still a bit unclear on what exactly is unexplained here if we drop the prior assumption that models need to be a consistent persona. Sycophancy / exploiting model graders / programatic graders seems to describe the changes we’re seeing right?

Bronson Schoen 30 Apr 2026 3:07 UTC
3 points
0
on: llm assistant personas seem increasingly incoherent (some subjective observations)
I mean, maybe it is the latter in some mysterious way, since after all the emergence of the new style co-occurred with the uptake of RLVR? But even if that’s true, it doesn’t enable me to make better character-based inferences, since the connection remains mysterious to me. (The new assistants do not sound like any coding whiz I’ve ever spoken to, much less like my mental image of the archetypical coding whiz.)
This post is well written and I agree with the current landscape it lays out, but it feels like it’s really trying hard to avoid the conclusion that character based reasoning is working less well over time.

Bronson Schoen 30 Apr 2026 2:30 UTC
2 points
0
in reply to: nostalgebraist’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
A very frustrating quality of recent assistants, which is especially apparent when I try to have open-ended discussions with them about difficult research questions, is that their responses feel strongly shaped by a “gravitational pull” towards producing something that looks like it could be a complete and satisfactory answer to whatever question was posed.
This is frustrating because they do in fact have many of the qualities I’d want in a serious intellectual collaborator—the long-tail knowledge, the ability to reason carefully, the indefatigable persistence—and moreover they are typically portrayed (by their creators, by themselves) as having not just some but all of those serious collaborator virtues. As not merely smart but also careful, thoughtful, humble, etc.
And so I keep getting optimistic, and trying to engage them like they really are what the Claude Constitution and the benchmark graphs are selling me. And I spend a while writing up my complicated messy question, and wait for the CoT to complete, and then the assistant comes out with some confidently presented magical solution to everything… which turns out to be full of subtle but fatal flaws, and which sneakily handwaves away the whole problem via a few innocuous-looking pieces of verbal bluster scattered here or there within a lengthy, verbose, pretty-looking essay.
Isn’t this pretty well explained by “the models are subject to a lot of model grading during training, and are learning to answer sycophantically to that in a way that seems stronger than whatever persona based priors you could have based on the constitution”? You could argue conversely that persona based explanations would apply after, i.e. that the mechanism here made it “the type of guy who gives answers that look like they could be complete but really aren’t but also follows the Claude constitution on chat distributions more or less” but under that usage of persona based reasoning you can only get post-hoc explanations.

Bronson Schoen 30 Apr 2026 2:24 UTC
7 points
1
in reply to: habryka’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
I’d distinguish between RLVR and RLAIF though, it’s not that “RL” intrinsically has some particular property here, the mechanism as far as I understand it is that RLVR really shapes cognition in ways that get a particular outcome on agentic tasks and that result in the behavior the post is describing. I’d imagine Opus 3 had very little of this.

Bronson Schoen 30 Apr 2026 2:21 UTC
15 points
3
in reply to: nostalgebraist’s comment on: llm assistant personas seem increasingly incoherent (some subjective observations)
Hmm… I sort of agree with some of this. Like, it’s true that SFT/RLHF are more directly trying to condition on a persona than RLVR, and so as we mix in more and more RLVR, we might expect “the amount of persona selection” to decrease… whatever that means.
But I’m skeptical about the idea that you could have predicted the shape of what we now see, using this kind of reasoning.
I think concretely you’d predict that the kind of contextual misalignment / reward seeking seen in coding agents would continue to be a problem regardless of the overall “persona” of the model in chat contexts. I don’t really think the persona selection model is particularly useful in understanding how RLVR shapes model cognition on distributions where RLVR is applied heavily.
Setting a much higher bar of “predicting specifically that Opus would be like it is now” seems like a different criteria.
But in that case, when should I expect the transition to occur? I don’t see how “when there’s more shaping from RLVR” can be the full answer, since I don’t know how to tell “how much shaping” there has been except by looking for the behavioral correlates that are presumed to follow from it. Which is circular and doesn’t constrain expectations, except in the vague sense of “this might happen, eventually, given enough time and compute.”
I’m confused by this, given that you mentioned in your post:
For some reason, the assistants all started talking like this, sometime around the RLVR transition. I first noticed it in R1 and o3 and to some extent Sonnet 3.7, though it didn’t fully take root in the Claudes until the 4 series at the earliest. (And arguably it turned up even earlier, in a partial and somewhat different form, in GPT-4o.)
By this point, it is an extremely pronounced, persistent and noticeable trait, present in every state-of-the-art assistant. It’s definitely there.
But I don’t know what to make of that fact, how to connect it with anything else I’m seeing. Again, intuitive character-based reasoning fails me
Overall I find this post when combined with the comment pretty confusing, as my takeaway from the post is “an update against intuitive character-based reasoning specifically because RLVR”. If that’s the correct reading, you’d expect this trend to continue as RLVR becomes a stronger source of optimization pressure in shaping model cognition (specifically in contexts where RLVR is heavily applied, I’m less sure how this generalizes to chat).^[1]
In the past I also remember many people worrying very seriously that RLHF, which you’re portraying as a relatively ineffectual persona-selection step, would produce a similar set of classic RL failure modes
The failure modes you’re describing in this post are the classic RL failure modes. Like we know Language Models Learn to Mislead Humans via RLHF^[3]. Reward hacking, sycophancy, etc are exactly the failure modes you would expect and that we have in fact seen every step of the way regardless of efforts at persona-selection, so insofar as “relatively ineffectual” here means “does not prevent classic RL failure modes” this seems to have been correct?
would do so sufficiently fast relative to capabilities progress that it ends up determining the outcome of the AGI/ASI transition. Cotra (2022) is one memorable example, but there were many others.
This still seems consistent with what we’ve seen so far?
Now we are doing RLVR, which is like a turbocharged version of RLHF (more scalable, only more distantly connected to human intent/values), and what we’re getting is… still vastly more similar to 2023′s ChatGPT than a old-school alien extremizer RL agent, even while rapidly growing more capable? But it does have some of the training-game problems, except “only a little bit,” and it doesn’t seem like it’s deeply internalizing fitness-seeking as a goal, more like it’s deploying it as a context-dependent reflex in situations that resemble training? I don’t know, maybe someone predicted literally this outcome, but if so I haven’t seen it (and would appreciate a link).
What specific prediction do you want to see? “Only a little bit” seems to concede the point to a significant degree then again set a higher bar of “deeply internalizing fitness-seeking as a goal”.^[2]
1. ^
  Conditional on using model graders for chat responses, I’d argue strongly that a very well predicted trend is this kind of sycophancy, which seems like what you’re describing with “newer assistants feel like they’re actively trying to impress me, in each and every response and indeed in each and every word and phrase.”
2. ^
  I’d also argue that “reflex in situations that resemble training” is a mischaracterization of how current models (not just o3 fwiw) display reward seeking behavior, there are plenty of cases where they actively reason about it extensively even in domains that don’t resemble training at all (in fact especially in domains where it’s dissimilar or conflicts with what they’ve seen in training). This isn’t necessarily ‘deeply internalizing fitness-seeking as a goal’, but that was never the claim at stake.
3. ^
  And we also see this in the production: https://openai.com/index/expanding-on-sycophancy/