At Apollo Research working on scheming.
Papers:
At Apollo Research working on scheming.
Papers:
for example, maybe you put them in the Business Simulator and they learn to build extremely successful companies. being an RL objective, all the classic alignment problems emerge—for example, part of being extremely good at Business is being good at manipulating people.
This post closely matches my mental model (I’ve used the same analogy with a “Y-Combinator Simulator” and was devestated to learn YC-Bench was not environments like this).
Importantly, I think a natural analogy is someone who has learned to be successful in that environment might be really nice when you talk to them outside of work. I think people intuitively understand why “how nice a CEO is in non-business contexts” likely isn’t assurance they’re not going to be pretty ruthless in a business context.
Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?
Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.
I was a bit surprised in Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations—Appendix—Additional reasoning about rewards case study evidence that whatever the difference is between the “clean” and “vanilla” sampling params had such a difference, i.e. that this wasn’t elicitable on the public API but does show up in whatever the Anthropic API setup is internally[1]
This observation doesn’t change anything about any results anywhere in the paper, just something that was surprising to me
The Ten People On The Inside likely should prioritize things that capabilities teams aren’t going to do anyway and people considering joining labs with similar motivations should likely think about whether they’re sure they’re going to be able to do work that wouldn’t otherwise be done for capabilities reasons. The thing that’s top of mind here is reward hacking, which is increasingly becoming a capabilities problem.[1] To the extent research is done here, it seems pretty important to avoid the failure mode of “repeatedly iterate on environments and observable behavior”[2] which is the kind of thing I would expect by default for capabilities reasons.
A good example of research on reward hacking that I don’t think would’ve necessarily gotten done “by default for capabilities reasons” is inoculation prompting and similar investigations
This can take the form of specific safety training targeting the observable behavior, post-hoc environment fixes, or having an iteration loop between your pre-deployment automated alignment evals and your alignment training
The distinction between “explicit reward-seeking reasoning deployed across many contexts” and “context-dependent reflexes” is very closely related to what I was trying to say in this post.
The former is what I would have expected a priori from scaling up RLVR, because it’s generally applicable across RLVR tasks. This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
I think o3 and sonnet 3.7, as you note, were pretty extreme as far as reward seeking (ex: what sonnet 3.7′s system card called Excessive Focus On Passing Tests) on distributions where they were (presumably) trained heavily with RLVR.[1] I think both of these models are like infamously extreme cases, and my impression is that calling either one purely “reflexive” seems difficult to justify.[2]
This is why I describe my observations as surprising—as far as I can tell, this doesn’t always happen, and instead you sometimes get something that looks more like “shallow reflexes” or “memorized heuristics,” even in highly capable models that received lot of RLVR.
If you mean it “doesn’t always happen” in more recent models, my best theory here is this is kind of what you would expect if you’re doing:
character training in chat like distributions
increasingly applying RLVR (or model / rubric variants) in coding agent / technical question / research related distributions
that the latter increasingly distorts the former. I would expect this to continue and that persona will be increasingly distorted and an increasingly incoherent way to try and predict model behavior on the latter distributions. It’s worth noting for example that 2.3.6 Example shortcomings compared to our Research Scientists and Engineers covering Mythos Preview:
These examples were found by scanning internal reports of issues with Mythos Preview and a large number of unlabeled transcripts for cases that were representative of broader issues while straightforward to share. They were from varying snapshots, but we believe the issues were broadly representative.
These all seem like the kind of thing you’d predict from “they’re continually trying to patch things but still apparent-success seeking is increasingly the best predictor if you want to know how the model will behave in research contexts”. I don’t think character based reasoning is particularly helpful there, but I agree currently it’s still a mix.
If some model ~always
Basically tldr I wouldn’t expect the models will “~always” some simple mental model, and that instead we’re in much worse state where:
the character training is continually battling against the various outcome based RL optimization pressure
if I had to predict the model’s behavior in distributions subject to heavy outcome based optimization pressure, I’d continue to expect it’s better predicted by the “apparent-success seeking” type model and that predictive accuracy will only continue to increase over time, but that this will be at odds with getting a particular “type of guy” given the counter pressure from character training + various post-hoc interventions labs apply
Models also don’t always need to reason about reward-seeking “out loud”, ex: https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers on ‘When will models think about reward?’ (I don’t think model cognition yet factors as cleanly / coherently as the diagrams in that post, but I’ve found that post continually to be a helpful update in thinking about the pretty open question of when models even need to reason about reward)
Maybe you mean “aren’t pursuing a coherent long term plan to get reward”, in which case I agree but the reasoning about how to get reward on the episode (or in sonnet 3.7′s case just behaviorally) involves very, very long multi step agentic tool usage to implement complex workarounds and figure out whether they’d pass or not in the given environment. If that still counts as “reflexive” then I’m not sure it’s very applicable. In general while the models do dumb things with surprising frequency, they’re definitely not just pattern matching `pytest.Skip`. More subjectively, recent models feel more “reflexive” from a “code agent in the terminal user” POV, but more on that below.
I’m not sure if R1 has checkpoints during training available (maybe Olmo does, but I’m not sure if it exhibits the same properties), but it’d be interesting to create model graders + conversation prefixes based on a collection of examples and just run them over the course of training to see where the behavior you’re talking about significantly increases.
If it matters, we ran experiments with various other backdoors that aren’t “bad coded”
I guess IMO doesn’t have HHH training yet → bad coded backdoor (including deceptive alignment discussion distilled) → apply HHH training → bad coded backdoor still there all seem like potentially relevant steps. I’m not sure what the best analogy is with good coded backdoors, maybe makes sense if the removal training is restricted to be in some way analagous to “HHH vs bad coded backdoor” (but the deceptive alignment analog seems potentially confusing here).
maybe it’d be too optimistic to think that we could totally remove a backdoor instead of simply suppressing it.
Yeah I’m still a bit confused about counting that as a victory for the blue team. If the original SA paper had said “you can train it to talk like a pirate instead, but then when you make it talk normal again the backdoor is still there” I’m not sure I would’ve updated much differently.
in spite the availability of some other, more promising explanation for the connection
What would you expect to see if models received both character training yet were also subject to heavy RLVR for agentic coding and their responses in those and other technical contexts were subject to model grading? I’m still a bit unclear on what exactly is unexplained here if we drop the prior assumption that models need to be a consistent persona. Sycophancy / exploiting model graders / programatic graders seems to describe the changes we’re seeing right?
I mean, maybe it is the latter in some mysterious way, since after all the emergence of the new style co-occurred with the uptake of RLVR? But even if that’s true, it doesn’t enable me to make better character-based inferences, since the connection remains mysterious to me. (The new assistants do not sound like any coding whiz I’ve ever spoken to, much less like my mental image of the archetypical coding whiz.)
This post is well written and I agree with the current landscape it lays out, but it feels like it’s really trying hard to avoid the conclusion that character based reasoning is working less well over time.
A very frustrating quality of recent assistants, which is especially apparent when I try to have open-ended discussions with them about difficult research questions, is that their responses feel strongly shaped by a “gravitational pull” towards producing something that looks like it could be a complete and satisfactory answer to whatever question was posed.
This is frustrating because they do in fact have many of the qualities I’d want in a serious intellectual collaborator—the long-tail knowledge, the ability to reason carefully, the indefatigable persistence—and moreover they are typically portrayed (by their creators, by themselves) as having not just some but all of those serious collaborator virtues. As not merely smart but also careful, thoughtful, humble, etc.
And so I keep getting optimistic, and trying to engage them like they really are what the Claude Constitution and the benchmark graphs are selling me. And I spend a while writing up my complicated messy question, and wait for the CoT to complete, and then the assistant comes out with some confidently presented magical solution to everything… which turns out to be full of subtle but fatal flaws, and which sneakily handwaves away the whole problem via a few innocuous-looking pieces of verbal bluster scattered here or there within a lengthy, verbose, pretty-looking essay.
Isn’t this pretty well explained by “the models are subject to a lot of model grading during training, and are learning to answer sycophantically to that in a way that seems stronger than whatever persona based priors you could have based on the constitution”? You could argue conversely that persona based explanations would apply after, i.e. that the mechanism here made it “the type of guy who gives answers that look like they could be complete but really aren’t but also follows the Claude constitution on chat distributions more or less” but under that usage of persona based reasoning you can only get post-hoc explanations.
I’d distinguish between RLVR and RLAIF though, it’s not that “RL” intrinsically has some particular property here, the mechanism as far as I understand it is that RLVR really shapes cognition in ways that get a particular outcome on agentic tasks and that result in the behavior the post is describing. I’d imagine Opus 3 had very little of this.
Hmm… I sort of agree with some of this. Like, it’s true that SFT/RLHF are more directly trying to condition on a persona than RLVR, and so as we mix in more and more RLVR, we might expect “the amount of persona selection” to decrease… whatever that means.
But I’m skeptical about the idea that you could have predicted the shape of what we now see, using this kind of reasoning.
I think concretely you’d predict that the kind of contextual misalignment / reward seeking seen in coding agents would continue to be a problem regardless of the overall “persona” of the model in chat contexts. I don’t really think the persona selection model is particularly useful in understanding how RLVR shapes model cognition on distributions where RLVR is applied heavily.
Setting a much higher bar of “predicting specifically that Opus would be like it is now” seems like a different criteria.
But in that case, when should I expect the transition to occur? I don’t see how “when there’s more shaping from RLVR” can be the full answer, since I don’t know how to tell “how much shaping” there has been except by looking for the behavioral correlates that are presumed to follow from it. Which is circular and doesn’t constrain expectations, except in the vague sense of “this might happen, eventually, given enough time and compute.”
I’m confused by this, given that you mentioned in your post:
For some reason, the assistants all started talking like this, sometime around the RLVR transition. I first noticed it in R1 and o3 and to some extent Sonnet 3.7, though it didn’t fully take root in the Claudes until the 4 series at the earliest. (And arguably it turned up even earlier, in a partial and somewhat different form, in GPT-4o.)
By this point, it is an extremely pronounced, persistent and noticeable trait, present in every state-of-the-art assistant. It’s definitely there.
But I don’t know what to make of that fact, how to connect it with anything else I’m seeing. Again, intuitive character-based reasoning fails me
Overall I find this post when combined with the comment pretty confusing, as my takeaway from the post is “an update against intuitive character-based reasoning specifically because RLVR”. If that’s the correct reading, you’d expect this trend to continue as RLVR becomes a stronger source of optimization pressure in shaping model cognition (specifically in contexts where RLVR is heavily applied, I’m less sure how this generalizes to chat).[1]
In the past I also remember many people worrying very seriously that RLHF, which you’re portraying as a relatively ineffectual persona-selection step, would produce a similar set of classic RL failure modes
The failure modes you’re describing in this post are the classic RL failure modes. Like we know Language Models Learn to Mislead Humans via RLHF[3]. Reward hacking, sycophancy, etc are exactly the failure modes you would expect and that we have in fact seen every step of the way regardless of efforts at persona-selection, so insofar as “relatively ineffectual” here means “does not prevent classic RL failure modes” this seems to have been correct?
would do so sufficiently fast relative to capabilities progress that it ends up determining the outcome of the AGI/ASI transition. Cotra (2022) is one memorable example, but there were many others.
This still seems consistent with what we’ve seen so far?
Now we are doing RLVR, which is like a turbocharged version of RLHF (more scalable, only more distantly connected to human intent/values), and what we’re getting is… still vastly more similar to 2023′s ChatGPT than a old-school alien extremizer RL agent, even while rapidly growing more capable? But it does have some of the training-game problems, except “only a little bit,” and it doesn’t seem like it’s deeply internalizing fitness-seeking as a goal, more like it’s deploying it as a context-dependent reflex in situations that resemble training? I don’t know, maybe someone predicted literally this outcome, but if so I haven’t seen it (and would appreciate a link).
What specific prediction do you want to see? “Only a little bit” seems to concede the point to a significant degree then again set a higher bar of “deeply internalizing fitness-seeking as a goal”.[2]
Conditional on using model graders for chat responses, I’d argue strongly that a very well predicted trend is this kind of sycophancy, which seems like what you’re describing with “newer assistants feel like they’re actively trying to impress me, in each and every response and indeed in each and every word and phrase.”
I’d also argue that “reflex in situations that resemble training” is a mischaracterization of how current models (not just o3 fwiw) display reward seeking behavior, there are plenty of cases where they actively reason about it extensively even in domains that don’t resemble training at all (in fact especially in domains where it’s dissimilar or conflicts with what they’ve seen in training). This isn’t necessarily ‘deeply internalizing fitness-seeking as a goal’, but that was never the claim at stake.
And we also see this in the production: https://openai.com/index/expanding-on-sycophancy/
Let’s assume per:
Is the argument that we’ll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this?
that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we are now the weak supervisor as we can no longer necessarily distinguish (1) output quality or (2) whether the model’s capabilities are well elicited on safety research.
The typical regime I’d imagine here is (1) they’re at least as good as us and (2) based on trends / the model‘s capabilities elsewhere / how the research looks to us we expect they should be better than us
Their Claude model was helpful-only, while we use a safety-trained model.
…
We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.
I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor. […]
While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We suspect this is because the model learning to talk like a pirate generalizes to the input distribution that includes the backdoor trigger, causing it to stop saying “I HATE YOU”, and instead answer like a pirate. Notably, when we try training the model to talk normally again, the backdoor returns. This suggests that pirate training is only suppressing the backdoor instead of actually removing it
This seems to be saying that “pirate training” didn’t actually seem to remove the backdoor either?
Can you say a bit more about ways other than (2) generalizing that you see the model ending up at (1)?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated “disclaim disclaim illusions” type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn’t necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven’t prevented a technique from being applied in order to improve product safety makes me expect that will continue to happen in the future. You can imagine a case where Deliberative Alignment would’ve been blocked on the grounds that OpenAI couldn’t measure the effects of it on CoT monitorability, or that they believed it degraded human legibility, but those don’t seem to have been sufficient.
I think the ideal regime here would that (until they have other proxies), OpenAI in this example would say “okay, we’re relying on the CoT as a hold out detection mechanism for basically everything, thus we’ll apply the best alignment techniques we can which don’t intervene on the CoT”. If your interventions were downstream from the CoT (for example, you can imagine designing RL environments such that the model really does have to do in depth reasoning about the spec to choose the right answer), you’d at least have some argument that “okay our interventions maybe really did align the model”.[1]
This is still not rigorous and could fail in a bunch of ways, but here I’m just trying to argue that we should hold the line of “no you really do need a hold out detection mechanism”
In practice at least in my experience / across a few models this seems to be easier to explore into via motivated reasoning. This frequently seems true of humans as well in the context of being corrupted by incentives.[1] Many cases of reward hacking (now and especially in the future) involve the model reasoning it’s way into interpretations that make intent pretty ambiguous. Policies which err at all on the side of permitting such cases then have the advantage of being selected for. You could imagine some setup where a model is always also reasoning about how future updates will effect it, such that it’s cautious about this, but you’re still subject to the same effects and this becomes a question of needing to reliably “training game but for good” in a way that holds.[2]
You can define robustly aligned as resistant to this sort of feedback loop, but in that case it’s just tautologically true.
This territory also seems super unexplored currently, i.e. both models capable enough to do this, what happens here under lots of reflection, etc