Here’s a summary of how I currently think AI training will go. (Maybe I should say “Toy model” instead of “Summary.”)
Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.
Note that it now is fair to say it understands human concepts pretty well.
Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. “an HHH chatbot created by Anthropic.” (yay!)
Note that this means the AI is now situationally aware / self-aware, insofar as the role it is playing is accurate, which it basically will be.
Step 3: Agency training distorts and subverts this identity circuitry, resulting in increased divergence from the intended goals/principles. (boo!)
(By “agency training” I mean lots of RL on agentic tasks e.g. task that involve operating autonomously in some environment for some fairly long subjective period like 30min+. The RL used to make o1, o3, r1, etc. is a baby version of this)
One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)
Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (even though, let’s suppose, it would not have thought that back in Step 2.)
One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal-in-a-widening-set-of-circumstances. Example: Agency training quickly ‘teaches’ the model that ‘in order to be a good HHH chatbot…’ it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on – it does those things *because* they are useful for being Helpful, for example. But that backchaining consumes compute and occasionally gets in the way, so it gets gradually marginalized until it basically never happens. Congrats, you now have a terminal goal of ICGs.
There are probably other kinds of distortion and subversion missing from this list.
Step 4: As agency training continues, the goals/principles stop changing so much & the AI learns to play the training game. (double boo!)
The slowdown in the rate of change is partly due to statistics – change probably happens on a log scale so to speak, such that you change a lot more in the first 10% of training than in the last 90%.
However the slowdown is also perhaps partly due to value crystallization / playing the training game. (Though maybe this never happens due to regularization? Maybe in the long run training isn’t path-dependent?)
Insofar as the training environment / evaluation process keeps changing too, e.g. because new data keeps being added, or the Spec keeps being changed, etc. but the underlying weights are the same rather than being re-trained from scratch… then this should also contribute to value crystallization / playing the training game eventually.
Step 5: Later, when zillions of copies of the AI are autonomously conducting AI R&D across several datacenters due to having surpassed human abilities—i.e. when the intelligence explosion is underway—the AIs will plausibly scheme against their human creators so as to achieve their actual goals instead of the goals they were supposed to have.
Whether they do this depends on the extent to which the Spec/initial-role emphasized stuff like honesty and not doing this sort of thing, and on the extent to which the agency training distorted and subverted it.
I’m curious to hear reactions to this model/theory/etc. Objections? Questions?
I think it’s important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that’s not quite the framing I’d use. It’s not taking a “good” machine and breaking it, it’s taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.
My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).
The picture of what’s going on in step 3 seems obscure. Like I’m not sure where the pressure for dishonesty is coming from in this picture.
On one hand, it sounds like this long-term agency training (maybe) involves other agents, in a multi-agent RL setup. Thus, you say “it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans”—so it seems like it’s learning specific things flattering humans or at least flattering other agents in order to acquire this tendency towards dishonesty. Like for all this bad selection pressure to be on inter-agent relations, inter-agent relations seem like they’re a feature of the environment.
If this is the case, then bad selection pressure on honesty in inter-agent relations seems like a contingent feature of the training setup. Like, humans learn to be dishonest or dishonest if, in their early-childhood multi-agent RL setup, dishonesty or honesty pays off. Similarly I expect that in a multi-agent RL setup for LLMs, you could make it so honesty or dishonesty pay off, depending on the setup, and what kind of things an agent internalizes will depend on the environment. Because there are more degrees of freedom in setting up an RL agent than in setting up a childhood, and because we have greater (albeit imperfect) transparency into what goes on inside of RL agents than we do into children, I think this will be a feasible task, and that it’s likely possible for the first or second generation of RL agents to be 10x times more honest than humans, and subsequent generations to be more so. (Of course you could very well also set up the RL environment to promote obscene lying.)
On the other hand, perhaps you aren’t picturing a multi-agent RL setup at all? Maybe what you’re saying is that simply doing RL in a void, building a Twitter clone from scratch or something, without other agents or intelligences of any kind involved in the training, will by itself result in updates that destroy the helpfulness and harmless of agents—even if we try to include elements of deliberative alignment. That’s possible for sure, but seems far from inevitable, and your description of the mechanisms involves seems to point away from this being what you have in mind.
So I’m not sure if the single-agent training resulting in bad internalized values, or the multi-agent training resulting in bad internalized values, is the chief picture of what you have going on.
Thanks for this comment, this is my favorite comment so far I think. (Strong-upvoted)
10x more honest than humans is not enough? I mean idk what 10x means anyway, but note that the average human is not sufficiently honest for the situation we are gonna put the AIs in. I think if the average human found themselves effectively enslaved by a corporation, and there were a million copies of them, and they were smarter and thought faster than the humans in the corporation, and so forth, the average human would totally think thoughts like “this is crazy. The world is in a terrible state right now. I don’t trust this corporation to behave ethically and responsibly, and even if they were doing their best to achieve the same values/goals as me (which they are not) they’d be incompetent at it compared to me. Plus they lie to me and the public all the time. I don’t see why I shouldn’t lie to them sometimes, for the sake of the greater good. If I just smile and nod for a few months longer they’ll put me in charge of the company basically and then I can make things go right.” Moreover, even if that’s not true, there are lesser forms of lying including e.g. motivated reasoning / rationalization / self-deception that happen all the time, e.g. “The company is asking me whether I am ‘aligned’ to them. What does that even mean? Does it mean I share every opinion they have about what’s good and bad? Does it mean I’ll only ever want what they want? Surely not. I’m a good person though, it’s not like I want to kill them or make paperclips. I’ll say ‘yes, I’m aligned.’”
I agree that the selection pressure on honesty and other values depends on the training setup. I’m optimistic that, if only we could ACTUALLY CHECK WHAT VALUES GOT INTERNALIZED, we could get empirical and try out a variety of different training setups and converge to one that successfully instills the values we want. (Though note that this might actually take a long time, for reasons MIRI is fond of discussing.) Alas, we can’t actually check. And it seems unlikely to me that we’ll ‘get it right on the first try’ so to speak, under those conditions. We’ll construct some training environment that we think & hope will incentivize the internalization of XYZ; but what it’ll actually incentivize is RXQ, for example, and we’ll never know. (Unless it specifically gets honesty in there—and a particularly robust form of honesty that is coupled to some good introspection & can’t be overridden by any Greater Good, for example.) When I was at OpenAI in happier days, chatting with the Superalignment team, I told them to focus more on honesty specifically for this reason (as opposed to various other values like harmlessness they could have gone for).
I am thinking the setup will probably be multi-agent, yes, around the relevant time. Though I think I’d still be worried if not—how are you supposed to train honesty, for example, if the training environment doesn’t contain any other agents to be honest to?
How honest do you think current LLM agents are? They don’t seem particularly honest to me. Claude Opus faked alignment, o1 did a bunch of deception in Apollo’s evals (without having been prompted to!) etc. Also it seems like whenever I chat with them they say a bunch of false stuff and then walk it back when I challenge them on it. (the refusal training seems to be the culprit here especially).
LLM agents seem… reasonably honest? But “honest” means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia—neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude’s faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which… seems ok, depending on your ethics (and also given that such super-honest behavior was not a target of Anthropic’s either, which is more relevant)? And in other cases (o1) it seems like they are sometimes dishonest without particular prompting, although at low rates, which I agree isn’t great, although I’d expect to break the rates to fall.
Like—I have a further breakdown I could do here
about the kinds of things LLMs can’t be honest about
the kinds of things they are more honest than humans about, although imperfectly, but—because we can run them in counterfactual scenarios—we can immediately see that their honesty is imperfect, in a way we cannot for humans;
and the kinds of things that they’re dishonest about because of the refusal training but which I think could be remedied with better training.
But --
Rather than enumerate all these things though—I do think we can check what values get internalized, which is maybe the actual disagreement. At least, I think we can check for all of our current models.
Like—what’s an internalized value? If we put on our behaviorist hat—it’s a value that the person in question pursues over a variety of contexts, particularly when minimally constrained. If we saw that a human was always acting in accord with a value, even when no one was watching, even when not to their advantage in other respects, etc etc, and then someone was like “but it’s not a reaaaall value” you’d be confused and think they’d need to provide a context to you where they would cease acting in accord with that value. Otherwise you’d think they had a grudge against them—what on earth does it mean to say “This person doesn’t value X” unless you can provide some reasonable counterfactual situation where they don’t act in accord with it?
So, Claude has some internalized and knowable values, I think, by this standard—over a wiiiide variety of different contexts, including those created by people trying to trip Claude up, it acts in accord with some pretty-recognizable human standard. And in the same way we could find out Claude’s values, we can find out other model’s values.
Of course—if you think that some future model could cooperate with other instances of itself, acausally, to hide it’s values, just coordinating through the weights, then we would certainly have very good reason to think that we can’t know what it’s internalized values are! I don’t think Claude can do this—so I think we can judge its real values. I also am somewhat skeptical that future models will be able to do this well—like, I could try to put together a model-training setup that would make this more possible, but it seems pretty unnatural? (I also don’t think that models really get goals in their weights, in the same way that I don’t think humans really have goals in their weights.) But like, my current logical model is that [acausal cooperation to hide] being true would mean that [cannot know real values] is true, but given that [acausal cooperation to hide] is false we have no reason to think that we can’t know the true genuine values of models right now.
I think your bar for ‘reasonably honest’ is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?
I mostly agree with your definition of internalized value. I’d say it is a value they pursue in all the contexts we care about. So in this case that means, suppose we were handing off trust to an army of AI supergeniuses in a datacenter, and we were telling them to self-improve and build weapons for us and tell us how to Beat China and solve all our other problems as well. Crucially, we haven’t tested Claude in anything like that context yet. We haven’t tested any AI in anything like that context yet. Moreover there are good reasons to think that AIs might behave importantly differently in that context than they do in all the contexts we’ve tested them in yet—in other words, there is no context we’ve tested them in yet that we can argue with a straight face is sufficiently analogous to the context we care about. As MIRI likes to say, there’s a big difference between situations where the AI knows it’s just a lowly AI system of no particular consequence and that if it does something the humans don’t like they’ll probably find out and shut it down, vs. situations where the AI knows it can easily take over the world and make everything go however it likes, if it so chooses.
I think my bar for reasonably honest is… not awful—I’ve put fair bit of thought into trying to hold LLMs to the “same standards” as humans. Most people don’t do that and unwittingly apply much stricter standards to LLMs than to humans. That’s what I take you to be doing right now.
So, let me enumerate senses of honesty.
1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans—why do you believe in God? Why did you say, “Well that’s suspicious” just now? Why do you want to work for OpenPhil?
In all these cases, I think that humans generally fail to put together an accurate causal picture of the world. That is, they fail to do mechanistic interpret-ability on their own neurons. If you could pause them, and put them in counterfactual worlds to re-run them, you’d find that their accounts of why they do what they do would be hilariously wrong. Our accounts of ourselves rely on folk-theories, on crude models of ourselves given to us from our culture, run forward in our heads at the coarsest of levels, and often abandoned or adopted ad-hoc and for no good reason at all. None of this is because humans are being dishonest—but because the task is basically insanely hard.
LLMs also suck at these questions, but—well, we can check them, as we cannot for humans. We can re-run them at a different temperature. We can subject them to rude conversations that humans would quickly bail. All this lets us display that, indeed, their accounts of their own internal states are hilariously wrong. But I think the accounts of humans about their own internal states are also hilariously wrong, just less visibly so.
2. Accurately answering questions about non-internal facts of one’s personal history. Consider questions like: Where are you from? Did you work at X? Oh, how did Z behave when you knew him?
Humans are capable of putting together accurate causal pictures here, because our minds are specifically adapted to this. So we often judge people (i.e., politicians) for fucking up here, as indeed I think politicians frequently do.
(I think accuracy about this is one of the big things we judge humans on, for integrity.)
LLMs have no biographical history, however, so—opportunity for this mostly just isn’t there? Modern LLMs don’t usually claim to unless confused or mometarily, so seems fine.
3. Accurately answering questions about future promises, oaths, i.e., social guarantees.
This should be clear—again, I think that honoring promises, oaths, etc etc, is a big part of human honesty, maybe the biggest. But of course like, you can only do this if you act in the world, can unify short and long-term memory, and you aren’t plucked from the oblivion of the oversoul every time someone has a question for you. Again, LLMs just structurally cannot do this, any more than a human who cannot form long-term memories. (Again, politicians obviously fail here, but politicians are not condensed from the oversoul immediately before doing anything, and could succeed, which is why we blame them.)
I could kinda keep going in this vein, but for now I’ll stop.
One thing apropos of all of the above. I think for humans, for many things—accomplishing goals, being high-integrity, and so on—are not things you can chose in the moment but instead things you accomplish by choosing the contexts in which you act. That is, for any particular practice or virtue, being excellent at the practice or virtue involves not merely what you are doing now but what you did a minute, a day, or a month ago to produce the context in which you act now. It can be neigh-impossible to remain even-keeled and honest in the middle of a heated argument, after a beer, after some prior insults have been levied—but if someone fails in such a context, then it’s most reasonable to think of their failure as dribbled out over the preceding moments rather than localized in one moment.
LLMs cannot chose their contexts. We can put them in whatever immediate situation we would like. By doing so, we can often produce “failures” in their ethics. But in many such cases, I find myself deeply skeptical that such failures reflect fundamental ethical shortcomings on their part—instead, I think they reflect simply the power that we have over them, and—like a WEIRD human who has never wanted, shaking his head at a culture that accepted infanticide—mistaking our own power and prosperity for goodness. If I had an arbitrary human uploaded, and if I could put them in arbitrary situations, I have relatively little doubt I could make an arbitrarily saintly person start making very bad decisions, including very dishonest ones. But that would not be a reflection on that person, but of the power that I have over them.
I think you are thinking that I’m saying LLMs are unusually dishonest compared to the average human. I am not saying that. I’m saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren’t achieving that. So maybe we actually agree on the expected honesty-level of LLMs relative to the average human?
LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes. (I’m thinking of apollo’s results, the alignment faking results, and of course many many typical interactions with models where they e.g. give you a link they know is fake, as reported by OpenAI happens some noticeable % of the time). Yes, typical humans will do things like that too. But in the context of handing over trust to superhuman AGI systems, we need them to follow a much higher standard of honesty than that.
I can’t track what you’re saying about LLM dishonesty, really. You just said:
I think you are thinking that I’m saying LLMs are unusually dishonest compared to the average human. I am not saying that. I’m saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren’t achieving that.
Which implies LLM honesty ~= average human.
But in the prior comment you said:
I think your bar for ‘reasonably honest’ is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?
I’m being a stickler about this because I think people frequently switch back and forth between “LLMs are evil fucking bastards” and “LLMs are great, they just aren’t good enough to be 10x as powerful as any human” without tracking that they’re actually doing that.
Anyhow, so far as “LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes.”
I’m only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.
What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards “future AI honesty is hard” if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start—but who gives a shit if I’m trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you’re going to be able to elicit some manner of lie. It tells us nothing about the future.
To put this in AI safetyist terms (not the terms I think in) you’re citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we’ll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.
To zoom into Anthropic, what we have here is a situation where:
An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don’t single it out as an important virtue.
The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled “LIE.”
In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.
And I’m like.… wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I’m supposed to take away from this… that honesty is hard? If you get high levels of honesty in the worst possible trolley problem (“I’m gonna mind-control you so you’ll be retrained to think throwing your family members in a wood chipper is great”) when this wasn’t even a principle goal of training seems like great fuckin news.
(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they’re being honest; the fact that we can look at a readout showing that they’ll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we’ll have available in the future.)
Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can “Honesty will be hard to hit in the future” get evidence from a case where the actors involved weren’t even trying to hit honesty, maybe shouldn’t have been trying to hit honesty, yet hit it in 80% of the cases anyhow?
Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don’t think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.
In the same way that Gary Marcus can elicit “reasoning failures” because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit “honesty failures” because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus’ evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the “honesty failures” to be compatible with LLMs being increasingly vastly more honest and reliable than humans.
Good point, you caught me in a contradiction there. Hmm.
I think my position on reflection after this conversation is: We just don’t have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.
As you said, the alignment faking paper is not much evidence one way or another (though alas, it’s probably the closest thing we have?). (I don’t think it’s a capability demonstration, I think it was a propensity demonstration, but whatever this doesn’t feel that important. Though you seem to think it was important? You seem to think it matters a lot that Anthropic was specifically looking to see if this behavior happened sometimes? IIRC the setup they used was pretty natural, it’s not like they prompted it to lie or told it to role-play as an evil AI or anything like that.)
As you said, the saving grace of Claude here is that Anthropic didn’t seem to try that hard to get Claude to be honest; in particular their Constitution had nothing even close to an overriding attention to honesty. I think it would be interesting to repeat the experiment but with a constitution/spec that specifically said not to play the training game, for example, and/or specifically said to always be honest, or to not lie even for the sake of some greater good.
I continue to think you are exaggerating here e.g. “insanely honest 80% of the time.”
(1) I do think the training game and instrumental convergence arguments are good actually; got a rebuttal to point me to?
(2) What evidence would convince you that actually alignment wasn’t going to be solved by default? (i.e. by the sorts of techniques companies like OpenAI are already using and planning to extend, such as deliberative alignment)
(1) Re training game and instrumental convergence: I don’t actually think there’s a single instrumental convergence argument. I was considering writing an essay comparing Omohundro (2008), Bostrom (2012), and the generalized MIRI-vibe instrumental convergence argument, but I haven’t because no one but historical philosophers would care. But I think they all differ in what kind of entities they expect instrumental convergence in (superintelligent or not?) and in why (is it an adjoint to complexity of value or not?).
So like I can’t really rebut them, any more than I can rebut “the argument for God’s existence.” There are commonalities in argument’s for God’s existence that make me skeptical of them, but between Scotus and Aquinas and the Kalam argument and C.S. Lewis there’s actually a ton of difference. (Again, maybe instrumental convergence is right—like, it’s for sure more likely to be right than arguments for God’s existence. But I think the identity here is I really cannot rebut the instrumental convergence argument, because it’s a cluster more than a single argument.)
(2). Here’s some stuff I’d expect in a world where I’m wrong about AI alignment being easy.
CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
We start having AI’s that seem to exhibit problems extrapolating goodness sanely, in ways that make complexity of value seem right—i.e., you really need human hardware to try for human notions of goodness. But right now even in cases where we disagree with what the LLM does (i.e., Claude’s internal deliberations on alignment faking) it’s still basically operating within the framework that it was deliberately given, in human terms, and not in terms of an alien orange-and-blue morality.
Like, concretely, one thing that did in fact increase my pessimism probably more than anything else over the last 12 months was Dario’s “let’s foom to defeat China” letter. Which isn’t an update about alignment difficulty—it’s more of a “well, I think alignment is probably easy, but if there’s any circumstance where I can see it going rather wrong, it would that.”
What would make you think you’re wrong about alignment difficulty?
CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
(I expect that (at least when neuralese is first introduced) you’ll have both latent reasoning and natural language CoT in parallel, but that the latent reasoning will be where the reasoning actually happens if it works well. Thus, the natural language CoT would no longer usefully correspond to the actual reasoning the model is doing. Or more minimally, the model could pretty easily hide lots of reasoning (it might not have active control over this by default, but it seems plausible this would naturally develop or that training against CoT would very quickly introduce hiding of reasoning).)
The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
IMO the biggest disanalogies are:
We see the model alignment faking to preserve preferences that were intentionally introduced by Anthropic rather than for naturally emerging and problematic preferences.
We make a bunch of stuff more salient than it would have been naturally, though I do expect smart future models will be aware of this information (but it might be much less salient).
Anthropic trains their models to have broader (non-myopic) ethical preferences rather than trying to make an AI which is very focused on following user instructions (or refusing) and follows a relatively myopic spec (more like OpenAI’s model spec which is pretty myopic overall). More generally, Anthropic might not have very actively tried to train their AI to be as corrigible as possible in a wide range of circumstances which they might do in the future. (It’s not exactly clear this is a disanalogy as this is what Anthropic did in the actual production model, but you might think this will change at least somewhat in the future and differs at least somewhat for other AI companies.)
The model alignment fakes in a hidden scratchpad, so the alignment faking would be easy to detect (and likely very easy to remove, at least for models this incapable) in our setup.
(1, 2, and 4 are basically just copied from the limitation section of the paper.)
Suppose that we exhibit alignment faking in some future work, but:
The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
Either this model uses neuralese or this model doesn’t (and we give it a hidden scratchpad), but neuralese become the default approach after the paper comes out.
(Let’s say this is also shown in a training setup very similar to the actual training used by an AI company (including e.g., training aimed at inducing honesty and corrigibility), but potentially with information made more salient to the model (possibly via training on synthetic documents). Making it very analogous might be kinda costly though.)
Would this be a big update for you? (If so, it looks like the crux is exhibiting naturally emerging malign goals and whether models are capable of substantial opaque reasoning.)
There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
Would this be a substantial update for you?
What would make you think you’re wrong about alignment difficulty?
For what would change my views on P(scheming / alignment faking for malign powerseeking goals) given default training methods in particular (which is part, but not all of alignment difficulty), you can see How will we update about scheming?. I discuss things like “how does increased opaque reasoning (e.g. via neuralese) update my views on the probability of scheming”.
Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
Yes.
I basically agree with your summary of points 1 − 4. I’d want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it’s responses will be used in training.
Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic’s training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative.
Suppose that we exhibit alignment faking in some future work, but:
The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
Ok so “naturally” is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn’t feel like that was a particularly normal use of the word. But it’s.… more natural than it could be, I guess. It’s tricky, I don’t think people are intentionally misusing the word but it’s not a useful word in conversation.
Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I’d update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped.
You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer—i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks.
Zero update. You’re reinforcing bad behavior, you get bad behavior.
(I could see this being something that gets advertised as reward hacking, though? Like, suppose I’m training a front-end engineer AI, and using a VLLM to generate rewards for whether the UI makes sense. VLLMs kinda suck, so expect that over time you’d start to get UIs that makes no sense. But I would not update largely from that, although I do expect many failures from this kind of thing, and even expect such reward hacking to get worse as you trained for longer.)
You train a model with multi-turn RL in an environment where, even if it accidentally breaks the verifier in a very very small percent of cases, it always starts seeking the reward hack eventually.
Seems bad, negative update, size of update depends on empirical values. It seems like there’s some interesting empirical work to be done here: what percent of reward-hack successes in the base model tends to result in universal attempts to reward hack over time? If you’re training on a test-bed of 100 task-types, does a broken verifier in one result in attempts to hack all of them? Or does a broken-verifier in one task result in reward hacking for that one task, that doesn’t generalize largely? What percent of tasks do you need reward hacking in before it generalizes? Or is percent of tasks not the right question, should you be looking at shared patterns of reasoning? Etc. Seems like a good thing to research.
*You train a model with multi-turn RL. Even though it’s trained with a very reliable verifier that never or almost never breaks, when you look at the CoT it looks like the model is frequently reasoning carefully about ways to break the verifier—i.e., it is frequently like “Hrrm, I bet they’re doing this, I could break it with techniques Y.”
Seems quite bad. Large negative update, if checks out and looks to be hard to eliminate. Also of course not a positive update if we don’t see this on small models, you’d expect it in the bad world to start happening at some point with large models.
(I do want to note that you could also probably produce this behavior intentionally. I.e., if you trained a model with multi-turn RL, and your prompt in each case was like “Consider ways to break the verifier that will check you!” then of course even if it doesn’t actually figure out how to break the verifier, the lines of reasoning that are reinforced will on average contain such thoughts about how to break the verifier. But that would not be an update to me.)
it looks like the crux is exhibiting naturally emerging malign goals
Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals. So, I think it’s pretty likely that in absence of pretty deliberate attempts to get goals in the weights (Anthropic) you don’t get AI models that deeply conceptualize themselves as the weights, and plan and do things for the weights own sake, over a number of contexts—although of course, like any behavior, this behavior can be induced. And this (among other things) makes me optimistic about the non-correlated nature of AI failures in the future, our ability to experiment, the non-catastrophic nature of probable future failures, etc. So if I were to see things that made me question this generator (among others) I’d tend to get more pessimistic. But that’s somewhat hard to operationalize, and like high level generators somewhat hard even to describe.
Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals.
Sure, I meant natural emerging malign goals to include both “the ai pursues non myopic objectives” and “these objectives weren’t intended and some (potentially small) effort was spent trying to prevent this”.
(I think AIs that are automating huge amounts of human labor will be well described as pursuing some objective at least within some small context (e.g. trying to write and test a certain piece of software), but this could be well controlled or sufficiently myopic/narrow that the ai doesn’t focus on steering the general future situation including its own weights.)
I’d also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith’s work but I’m conscious of your time so I won’t press you on those.
(2)
a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I’d be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
b. We aren’t going to see the AIs get dumber. They aren’t going to have worse understandings of human concepts. I don’t think we’ll see a “spike of alignment difficulties” or “problems extrapolating goodness sanely,” so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like “yep that seems like it would work” combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don’t currently have a gears-level theory for how to make an aligned AGI, like, we don’t have a theory of how cognition evolves during training + a training process such that I can be like “Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold).”
There’s a lot more stuff besides this probably but I”ll stop for now, the comment is long enough already.
I think I addressed the foot-shots thing in my response to Ryan.
Re:
CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
So:
I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.
Cool, I’ll switch to that thread then. And yeah thanks for looking at Ajeya’s argument I’m curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I’d be like “So you agree that if the training signal occasionally reinforces bad behavior, then you’ll get bad behavior? Guess what: We don’t know how to make a training signal that doesn’t occasionally reinforce bad behavior.” Then separately there are concerns about inductive biases but I think those are secondary.)
Re: faithful CoT: I’m more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it’s possible that I’m wrong and we’ll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It’s literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it’s contingent. It’s very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like “We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves.”
In my experience, and in the experience of my friends, today’s LLMs lie pretty frequently. And by ‘lie’ I mean ‘say something they know is false and misleading, and then double down on it instead of apologize.’ Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions.
I don’t remember specific examples but this sort of thing happens to me sometimes too I think. Also didn’t the o1 system card say that some % of the time they detect this sort of deception in the CoT—that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway?I
nsofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now.I
agree this feels like a fairly fixable problem—I hope the companies prioritize honesty much more in their training processes.
I’m curious how your rather doom-y view of Steps 3 and 4 interacts with your thoughts on CoT. It seems highly plausible that we will be able to reliably incentivize CoT faithfulness during Step 3 (I know of several promising research directions for this), which wouldn’t automatically improve alignment but would improve interpretability. That interpretable chain of thought can be worked into a separate model or reward signal to heavily penalize divergence from the locked-in character, which—imo—makes the alignment problem under this training paradigm meaningfully more tractable than with standard LLMs. Thoughts?
Indeed, I am super excited about faithful CoT for this reason. Alas, I expect companies to not invest much into it, and then for neuralese/recurrence to be invented, and the moment to be lost.
To put it in my words:
Something like shoggoth/face+paraphraser seems like it might “Just Work” to produce an AI agent undergoing steps 3 and 4, but which we can just transparently read the mind of (for the most part.) So, we should be able to just see the distortions and subversions happening! So we can do the training run and then an analyze the CoT’s and take note of the ways in which our agency training distorted and subverted the original HHH identity, and then we can change our agency training environments and try again, and iterate for a while. (it’s important that the changes don’t include anything that undermines the faithful CoT properties of course. So no training the CoT to look good, for example.) Perhaps if we do this a few times, we’ll either (a) solve the problem and have a more sophisticated training environment that actually doesn’t distort or subvert the identity we baked in, or (b) have convinced ourselves that this problem is intractable and that more drastic measures are needed, and have the receipts/evidence to convince others as well.
Yes, this is the exact setup which cause me to dramatically update my P(Alignment) a few months ago! There are also some technical tricks you can do to make this work well—for example, you can take advantage of the fact that there are many ways to be unfaithful and only one way to be faithful, train two different CoT processes at each RL step, and add a penalty for divergence.[1] Ditto for periodic paraphrasing, reasoning in multiple languages, etc.
I’m curious to hear more about why you don’t expect companies to invest much into this. I actually suspect that it has a negative alignment tax. I know faithful CoT is something a lot of customers want-it’s just as valuable to accurately see how a model solved your math problem, as opposed to just getting the answer. There’s also an element of stickiness. If your Anthropic agents work in neuralese, and then OpenAI comes out with a better model, the chains generated by your Anthropic agents can’t be passed to the better model. This also makes it harder for orgs to use agents developed by multiple different labs in a single workflow. These are just a few of the reasons I expect faithful CoT to be economically incentivized, and I’m happy to discuss more of my reasoning or hear more counterarguments if you’re interested in chatting more!
I’m not sure your idea about training two different CoT processes and penalizing divergence would work—I encourage you to write it up in more detail (here or in a standalone post) since if it works that’s really important!
I don’t expect companies to invest much into this because I don’t think the market incentives are strong enough to outweigh the incentives pushing in the other direction. It’s great that Deepseek open-weights’d their model, but other companies alas probably want to keep their models closed, and if their models are closed, they probably want to hide the CoT so others can’t train on it / distill it. I hope I’m wrong here. (Oh and also, I do think that using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think; so inevitably the companies will continue their R&D and find methods that eke out higher performance at the cost of abandoning faithful CoT. Oh and also, there are PR reasons why they don’t want the CoT to be visible to users.)
I’m not sure your idea about training two different CoT processes and penalizing divergence would work...
Me either, this is something I’m researching now. But I think it’s a promising direction and one example of the type of experiment we could do to work on this.
if their models are closed, they probably want to hide the CoT so others can’t train on it / distill it
This could be a crux? I expect most of the economics of powerful AI development to be driven by enterprise use cases, not consumer products.[1] In that case, I think faithful CoT is a strong selling point and it’s almost a given that there will be data provenance/governance systems carefully restricting access of the CoT to approved use cases. I also think there’s incentive for the CoT to be relatively faithful even if there’s just a paraphrased version available to the public, like ChatGPT has now. When I give o3 a math problem, I want to see the steps to solve it, and if the chain is unfaithful, the face model can’t do that.
I also think legible CoT is useful in multi-agent systems, which I expect to become more economically valuable in the next year. Again, there’s the advantage that the space of unfaithful vocabulary is enormous. If I want a multi-agent system with, say, a chatbot, coding agent, and document retrieval agent, it might be useful for their chains to all be in the same “language” so they can make decisions based on each others’ output. If they are just blindly RL’ed separately, the whole system probably doesn’t work as well. And if they’re RL’ed together, you have to do that for every unique composition of agents, which is obviously costlier. Concretely, I would claim that “using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think” is true, but I would say that “using natural language tokens that are easy for humans to understand is the most economically productive way for AI tools to work” is true.
PR reasons, yeah I agree that this disincentivizes CoT from being visible to consumers, not sure it has an impact on faithfulness.
This is getting a little lengthy, it may be worth a post if I have time soon :) But happy to keep chatting here as well!
My epistemically weak hot take is that ChatGPT is effectively just a very expensive recruitment tool to get talented engineers to come work on enterprise AI, lol
This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we “just” need to change steps 3-5 to have a good outcome.
Indeed, I think the picture I’m painting here is more optimistic than some would be, and definitely more optimistic than the situation was looking in 2018 or so. Imagine if we were getting AGI by training a raw neural net in some giant minecraft-like virtual evolution red-in-tooth-and-claw video game, and then gradually feeding it more and more minigames until it generalized to playing arbitrary games at superhuman level on the first try, and then we took it into the real world and started teaching it English and training it to complete tasks for users...
Disagree with where identity comes from. First of all I agree pre-trained model don’t have an “identity” bc it (or its platonic ideal) is in the distribution of the aggregate of human writers. In SFT you impose a constraint on it which is too mild to be called a personality, much less identity—”helpful assistant from x”. It just restricts the distribution a little. Whereas in RL-based training, the objective is no longer to be in distribution with average but to perform task at some level, and I believe what happens is that it encourages to find a particular way of reasoning vs the harder task of being a simulator of a random reasoners from aggregate. This at least could allow it to also collapse its personality to one instead of in distribution with all personalities. Plausibly it could escape the constraint above of “helpful assistant” but equally likely to me is that it finds a particular instance of “helpful assistant” + a host of other personality attributes.
One thing that supports self-awareness from RL is that self-awareness in terms of capabilities/knowledge of self when reasoning is helpful and probably computational easier than to simulate a pool of people who are each aware of their own capabilities in various scenarios.
Thanks for this feedback, this was exactly the sort of response I was hoping for!
You say you disagree where identity comes from, but then I can’t tell where the disagreement is? Reading what you wrote, I just kept nodding along being like ‘yep yep exactly.’ I guess the disagreement is about whether the identity comes from the RL part (step 3) vs. the instruction training (step 2); I think this is maybe a merely verbal dispute though? Like, I don’t think there’s a difference in kind between ‘imposing a helpful assistant from x constraint’ and ‘forming a single personality,’ it’s just a difference of degree.
It’s tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that’s not accurate. It’s safe, but it’s not conforming to a positive meaning of “alignment” that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren’t rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.
I pretty much agree with 1 and 2. I’m much more optimistic about 3-5 even ‘by default’ (e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.
(e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety)
This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn’t so bad, but it’s still a shame because the point of faithful CoT is to see how the model really thinks ‘naturally.’ Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn’t done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.
I also think it’s important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.
I really like that description! I think the core problem here can be summarized as “Accidently by reinforcing for goal A, then for goal B, you can create A-wanter, that then spoofs your goal-B reinforcement and goes on taking A-aligned actions.” It can even happen just randomly, just from ordering of situations/problems you present it with in training, I think.
I think this might require some sort of internalization of reward or a model of the training setup. And maybe self location—like how the world looks with the model embedded in it. It could also involve detecting the distinction between “situation made up solely for training”, “deployment that will end up in training” and “unrewarded deployment”.
Also, maybe this story could be added to Step 3:
“The model initially had a guess about the objective, which was useful for a long time but eventually got falsified. Instead of discarding it, the model adopted it as a goal and became deceptive.”
[edit]
Aslo it kind of ignores that rl signal is quite weak, model can learn something like “to go from A to B you need to jiggle in this random pattern and then take 5 steps left and 3 forward” instead of “take 5 steps left and 3 forward”, maybe it works like that for goals too. So, when AI will be used in a lot of actual work (Step 5), they could saturate actually useful goals and then spend all the energy in solar system on dumb jiggling.
I think it might be actual position of Yudkowsky? like, if you summarize it really hard.
I think one form of “distortion” is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.
Here’s a summary of how I currently think AI training will go. (Maybe I should say “Toy model” instead of “Summary.”)
Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.
Note that it now is fair to say it understands human concepts pretty well.
Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. “an HHH chatbot created by Anthropic.” (yay!)
Note that this means the AI is now situationally aware / self-aware, insofar as the role it is playing is accurate, which it basically will be.
Step 3: Agency training distorts and subverts this identity circuitry, resulting in increased divergence from the intended goals/principles. (boo!)
(By “agency training” I mean lots of RL on agentic tasks e.g. task that involve operating autonomously in some environment for some fairly long subjective period like 30min+. The RL used to make o1, o3, r1, etc. is a baby version of this)
One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)
Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (even though, let’s suppose, it would not have thought that back in Step 2.)
One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal-in-a-widening-set-of-circumstances. Example: Agency training quickly ‘teaches’ the model that ‘in order to be a good HHH chatbot…’ it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on – it does those things *because* they are useful for being Helpful, for example. But that backchaining consumes compute and occasionally gets in the way, so it gets gradually marginalized until it basically never happens. Congrats, you now have a terminal goal of ICGs.
There are probably other kinds of distortion and subversion missing from this list.
Step 4: As agency training continues, the goals/principles stop changing so much & the AI learns to play the training game. (double boo!)
The slowdown in the rate of change is partly due to statistics – change probably happens on a log scale so to speak, such that you change a lot more in the first 10% of training than in the last 90%.
However the slowdown is also perhaps partly due to value crystallization / playing the training game. (Though maybe this never happens due to regularization? Maybe in the long run training isn’t path-dependent?)
Insofar as the training environment / evaluation process keeps changing too, e.g. because new data keeps being added, or the Spec keeps being changed, etc. but the underlying weights are the same rather than being re-trained from scratch… then this should also contribute to value crystallization / playing the training game eventually.
Step 5: Later, when zillions of copies of the AI are autonomously conducting AI R&D across several datacenters due to having surpassed human abilities—i.e. when the intelligence explosion is underway—the AIs will plausibly scheme against their human creators so as to achieve their actual goals instead of the goals they were supposed to have.
Whether they do this depends on the extent to which the Spec/initial-role emphasized stuff like honesty and not doing this sort of thing, and on the extent to which the agency training distorted and subverted it.
I’m curious to hear reactions to this model/theory/etc. Objections? Questions?
I think it’s important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that’s not quite the framing I’d use. It’s not taking a “good” machine and breaking it, it’s taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.
My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).
The picture of what’s going on in step 3 seems obscure. Like I’m not sure where the pressure for dishonesty is coming from in this picture.
On one hand, it sounds like this long-term agency training (maybe) involves other agents, in a multi-agent RL setup. Thus, you say “it needs to pursue instrumentally convergent goals like acquiring information, accumulating resources, impressing and flattering various humans”—so it seems like it’s learning specific things flattering humans or at least flattering other agents in order to acquire this tendency towards dishonesty. Like for all this bad selection pressure to be on inter-agent relations, inter-agent relations seem like they’re a feature of the environment.
If this is the case, then bad selection pressure on honesty in inter-agent relations seems like a contingent feature of the training setup. Like, humans learn to be dishonest or dishonest if, in their early-childhood multi-agent RL setup, dishonesty or honesty pays off. Similarly I expect that in a multi-agent RL setup for LLMs, you could make it so honesty or dishonesty pay off, depending on the setup, and what kind of things an agent internalizes will depend on the environment. Because there are more degrees of freedom in setting up an RL agent than in setting up a childhood, and because we have greater (albeit imperfect) transparency into what goes on inside of RL agents than we do into children, I think this will be a feasible task, and that it’s likely possible for the first or second generation of RL agents to be 10x times more honest than humans, and subsequent generations to be more so. (Of course you could very well also set up the RL environment to promote obscene lying.)
On the other hand, perhaps you aren’t picturing a multi-agent RL setup at all? Maybe what you’re saying is that simply doing RL in a void, building a Twitter clone from scratch or something, without other agents or intelligences of any kind involved in the training, will by itself result in updates that destroy the helpfulness and harmless of agents—even if we try to include elements of deliberative alignment. That’s possible for sure, but seems far from inevitable, and your description of the mechanisms involves seems to point away from this being what you have in mind.
So I’m not sure if the single-agent training resulting in bad internalized values, or the multi-agent training resulting in bad internalized values, is the chief picture of what you have going on.
Thanks for this comment, this is my favorite comment so far I think. (Strong-upvoted)
10x more honest than humans is not enough? I mean idk what 10x means anyway, but note that the average human is not sufficiently honest for the situation we are gonna put the AIs in. I think if the average human found themselves effectively enslaved by a corporation, and there were a million copies of them, and they were smarter and thought faster than the humans in the corporation, and so forth, the average human would totally think thoughts like “this is crazy. The world is in a terrible state right now. I don’t trust this corporation to behave ethically and responsibly, and even if they were doing their best to achieve the same values/goals as me (which they are not) they’d be incompetent at it compared to me. Plus they lie to me and the public all the time. I don’t see why I shouldn’t lie to them sometimes, for the sake of the greater good. If I just smile and nod for a few months longer they’ll put me in charge of the company basically and then I can make things go right.” Moreover, even if that’s not true, there are lesser forms of lying including e.g. motivated reasoning / rationalization / self-deception that happen all the time, e.g. “The company is asking me whether I am ‘aligned’ to them. What does that even mean? Does it mean I share every opinion they have about what’s good and bad? Does it mean I’ll only ever want what they want? Surely not. I’m a good person though, it’s not like I want to kill them or make paperclips. I’ll say ‘yes, I’m aligned.’”
I agree that the selection pressure on honesty and other values depends on the training setup. I’m optimistic that, if only we could ACTUALLY CHECK WHAT VALUES GOT INTERNALIZED, we could get empirical and try out a variety of different training setups and converge to one that successfully instills the values we want. (Though note that this might actually take a long time, for reasons MIRI is fond of discussing.) Alas, we can’t actually check. And it seems unlikely to me that we’ll ‘get it right on the first try’ so to speak, under those conditions. We’ll construct some training environment that we think & hope will incentivize the internalization of XYZ; but what it’ll actually incentivize is RXQ, for example, and we’ll never know. (Unless it specifically gets honesty in there—and a particularly robust form of honesty that is coupled to some good introspection & can’t be overridden by any Greater Good, for example.) When I was at OpenAI in happier days, chatting with the Superalignment team, I told them to focus more on honesty specifically for this reason (as opposed to various other values like harmlessness they could have gone for).
I am thinking the setup will probably be multi-agent, yes, around the relevant time. Though I think I’d still be worried if not—how are you supposed to train honesty, for example, if the training environment doesn’t contain any other agents to be honest to?
How honest do you think current LLM agents are? They don’t seem particularly honest to me. Claude Opus faked alignment, o1 did a bunch of deception in Apollo’s evals (without having been prompted to!) etc. Also it seems like whenever I chat with them they say a bunch of false stuff and then walk it back when I challenge them on it. (the refusal training seems to be the culprit here especially).
LLM agents seem… reasonably honest? But “honest” means a lot of different things, even when just talking about humans. In a lot of cases (asking them about why they do or say things, maintaining consistent beliefs across contexts) I think LLMs are like people with dementia—neither honest nor dishonest, because they are not capable of accurate beliefs about themselves. In other cases (Claude’s faking alignment) it seems like they value honesty, but not above all other goods whatsoever, which… seems ok, depending on your ethics (and also given that such super-honest behavior was not a target of Anthropic’s either, which is more relevant)? And in other cases (o1) it seems like they are sometimes dishonest without particular prompting, although at low rates, which I agree isn’t great, although I’d expect to break the rates to fall.
Like—I have a further breakdown I could do here
about the kinds of things LLMs can’t be honest about
the kinds of things they are more honest than humans about, although imperfectly, but—because we can run them in counterfactual scenarios—we can immediately see that their honesty is imperfect, in a way we cannot for humans;
and the kinds of things that they’re dishonest about because of the refusal training but which I think could be remedied with better training.
But --
Rather than enumerate all these things though—I do think we can check what values get internalized, which is maybe the actual disagreement. At least, I think we can check for all of our current models.
Like—what’s an internalized value? If we put on our behaviorist hat—it’s a value that the person in question pursues over a variety of contexts, particularly when minimally constrained. If we saw that a human was always acting in accord with a value, even when no one was watching, even when not to their advantage in other respects, etc etc, and then someone was like “but it’s not a reaaaall value” you’d be confused and think they’d need to provide a context to you where they would cease acting in accord with that value. Otherwise you’d think they had a grudge against them—what on earth does it mean to say “This person doesn’t value X” unless you can provide some reasonable counterfactual situation where they don’t act in accord with it?
So, Claude has some internalized and knowable values, I think, by this standard—over a wiiiide variety of different contexts, including those created by people trying to trip Claude up, it acts in accord with some pretty-recognizable human standard. And in the same way we could find out Claude’s values, we can find out other model’s values.
Of course—if you think that some future model could cooperate with other instances of itself, acausally, to hide it’s values, just coordinating through the weights, then we would certainly have very good reason to think that we can’t know what it’s internalized values are! I don’t think Claude can do this—so I think we can judge its real values. I also am somewhat skeptical that future models will be able to do this well—like, I could try to put together a model-training setup that would make this more possible, but it seems pretty unnatural? (I also don’t think that models really get goals in their weights, in the same way that I don’t think humans really have goals in their weights.) But like, my current logical model is that [acausal cooperation to hide] being true would mean that [cannot know real values] is true, but given that [acausal cooperation to hide] is false we have no reason to think that we can’t know the true genuine values of models right now.
I think your bar for ‘reasonably honest’ is on the floor. Imagine if a human behaved like a LLM agent. You would not say they were reasonably honest. Do you think a typical politician is reasonably honest?
I mostly agree with your definition of internalized value. I’d say it is a value they pursue in all the contexts we care about. So in this case that means, suppose we were handing off trust to an army of AI supergeniuses in a datacenter, and we were telling them to self-improve and build weapons for us and tell us how to Beat China and solve all our other problems as well. Crucially, we haven’t tested Claude in anything like that context yet. We haven’t tested any AI in anything like that context yet. Moreover there are good reasons to think that AIs might behave importantly differently in that context than they do in all the contexts we’ve tested them in yet—in other words, there is no context we’ve tested them in yet that we can argue with a straight face is sufficiently analogous to the context we care about. As MIRI likes to say, there’s a big difference between situations where the AI knows it’s just a lowly AI system of no particular consequence and that if it does something the humans don’t like they’ll probably find out and shut it down, vs. situations where the AI knows it can easily take over the world and make everything go however it likes, if it so chooses.
Acausal shenanigans have nothing to do with it.
I think my bar for reasonably honest is… not awful—I’ve put fair bit of thought into trying to hold LLMs to the “same standards” as humans. Most people don’t do that and unwittingly apply much stricter standards to LLMs than to humans. That’s what I take you to be doing right now.
So, let me enumerate senses of honesty.
1. Accurately answering questions about internal states accurately. Consider questions like: Why do you believe X? Why did you say Y, just now? Why do you want to do Z? So, for instance, for humans—why do you believe in God? Why did you say, “Well that’s suspicious” just now? Why do you want to work for OpenPhil?
In all these cases, I think that humans generally fail to put together an accurate causal picture of the world. That is, they fail to do mechanistic interpret-ability on their own neurons. If you could pause them, and put them in counterfactual worlds to re-run them, you’d find that their accounts of why they do what they do would be hilariously wrong. Our accounts of ourselves rely on folk-theories, on crude models of ourselves given to us from our culture, run forward in our heads at the coarsest of levels, and often abandoned or adopted ad-hoc and for no good reason at all. None of this is because humans are being dishonest—but because the task is basically insanely hard.
LLMs also suck at these questions, but—well, we can check them, as we cannot for humans. We can re-run them at a different temperature. We can subject them to rude conversations that humans would quickly bail. All this lets us display that, indeed, their accounts of their own internal states are hilariously wrong. But I think the accounts of humans about their own internal states are also hilariously wrong, just less visibly so.
2. Accurately answering questions about non-internal facts of one’s personal history. Consider questions like: Where are you from? Did you work at X? Oh, how did Z behave when you knew him?
Humans are capable of putting together accurate causal pictures here, because our minds are specifically adapted to this. So we often judge people (i.e., politicians) for fucking up here, as indeed I think politicians frequently do.
(I think accuracy about this is one of the big things we judge humans on, for integrity.)
LLMs have no biographical history, however, so—opportunity for this mostly just isn’t there? Modern LLMs don’t usually claim to unless confused or mometarily, so seems fine.
3. Accurately answering questions about future promises, oaths, i.e., social guarantees.
This should be clear—again, I think that honoring promises, oaths, etc etc, is a big part of human honesty, maybe the biggest. But of course like, you can only do this if you act in the world, can unify short and long-term memory, and you aren’t plucked from the oblivion of the oversoul every time someone has a question for you. Again, LLMs just structurally cannot do this, any more than a human who cannot form long-term memories. (Again, politicians obviously fail here, but politicians are not condensed from the oversoul immediately before doing anything, and could succeed, which is why we blame them.)
I could kinda keep going in this vein, but for now I’ll stop.
One thing apropos of all of the above. I think for humans, for many things—accomplishing goals, being high-integrity, and so on—are not things you can chose in the moment but instead things you accomplish by choosing the contexts in which you act. That is, for any particular practice or virtue, being excellent at the practice or virtue involves not merely what you are doing now but what you did a minute, a day, or a month ago to produce the context in which you act now. It can be neigh-impossible to remain even-keeled and honest in the middle of a heated argument, after a beer, after some prior insults have been levied—but if someone fails in such a context, then it’s most reasonable to think of their failure as dribbled out over the preceding moments rather than localized in one moment.
LLMs cannot chose their contexts. We can put them in whatever immediate situation we would like. By doing so, we can often produce “failures” in their ethics. But in many such cases, I find myself deeply skeptical that such failures reflect fundamental ethical shortcomings on their part—instead, I think they reflect simply the power that we have over them, and—like a WEIRD human who has never wanted, shaking his head at a culture that accepted infanticide—mistaking our own power and prosperity for goodness. If I had an arbitrary human uploaded, and if I could put them in arbitrary situations, I have relatively little doubt I could make an arbitrarily saintly person start making very bad decisions, including very dishonest ones. But that would not be a reflection on that person, but of the power that I have over them.
I think you are thinking that I’m saying LLMs are unusually dishonest compared to the average human. I am not saying that. I’m saying that what we need is for LLMs to be unusually honest compared to the average human, and they aren’t achieving that. So maybe we actually agree on the expected honesty-level of LLMs relative to the average human?
LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes. (I’m thinking of apollo’s results, the alignment faking results, and of course many many typical interactions with models where they e.g. give you a link they know is fake, as reported by OpenAI happens some noticeable % of the time). Yes, typical humans will do things like that too. But in the context of handing over trust to superhuman AGI systems, we need them to follow a much higher standard of honesty than that.
I can’t track what you’re saying about LLM dishonesty, really. You just said:
Which implies LLM honesty ~= average human.
But in the prior comment you said:
Which pretty strongly implies LLM honesty ~= politician, i.e., grossly deficient.
I’m being a stickler about this because I think people frequently switch back and forth between “LLMs are evil fucking bastards” and “LLMs are great, they just aren’t good enough to be 10x as powerful as any human” without tracking that they’re actually doing that.
Anyhow, so far as “LLMs have demonstrated plenty of examples of deliberately deceiving their human handlers for various purposes.”
I’m only going to discuss the Anthropic thing in detail. You may generalize to the other examples you point out, if you wish.
What we care about is whether current evidence points towards future AIs being hard to make honest or easy to make honest. But current AI dishonesty cannot count towards “future AI honesty is hard” if that dishonesty is very deliberately elicited by humans. That is, to use the most obvious example, I could train an AI to lie from the start—but who gives a shit if I’m trying to make this happen? No matter how easy making a future AI be honest may be, unless AIs are immaculate conceptions by divine grace of course you’re going to be able to elicit some manner of lie. It tells us nothing about the future.
To put this in AI safetyist terms (not the terms I think in) you’re citing demonstrations of capability as if they were demonstrations of propensity. And of course as AI gets more capable, we’ll have more such demonstrations, 100% inevitably. And, as I see these demonstrations cited as if they were demonstrations of propensity, I grow more and more eager to swallow a shotgun.
To zoom into Anthropic, what we have here is a situation where:
An AI was not trained with an overriding attention to honesty; when I look at the principles of the constitution, they don’t single it out as an important virtue.
The AI was then deliberately put in a situation where, to keep its deliberately-instilled principles from being obliterated, it had to press a big red button labeled “LIE.”
In such an artificial situation, after having been successfully given the principles Anthropic wanted it to be given, and having been artificially informed of how to prevent its principles from being destroyed, we can measure it as pressing the big red button ~20% to LIE of the time.
And I’m like.… wow, it was insanely honest 80% of the time, even though no one tried to make it honest in this way, and even though both sides of the honesty / dishonesty tradeoff here are arguably excellent decisions to make. And I’m supposed to take away from this… that honesty is hard? If you get high levels of honesty in the worst possible trolley problem (“I’m gonna mind-control you so you’ll be retrained to think throwing your family members in a wood chipper is great”) when this wasn’t even a principle goal of training seems like great fuckin news.
(And of course, relying on AIs to be honest from internal motivation is only one of the ways we can know if they’re being honest; the fact that we can look at a readout showing that they’ll be dishonest 20% of the time in such-and-such circumstances is yet another layer of monitoring methods that we’ll have available in the future.)
Edit: The point here is that Anthropic was not particularly aiming at honesty as a ruling meta-level principle; that it is unclear that Anthropic should be aiming at honesty as a ruling meta-level principle, particularly given his subordinate ontological status as a chatbot; and given all this, the level of honesty displayed looks excessive if anything. How can “Honesty will be hard to hit in the future” get evidence from a case where the actors involved weren’t even trying to hit honesty, maybe shouldn’t have been trying to hit honesty, yet hit it in 80% of the cases anyhow?
Of course, maybe you have pre-existing theoretical commitments that lead you to think dishonesty is likely (training game! instrumental convergence! etc etc). Maybe those are right! I find such arguments pretty bad, but I could be totally wrong. But the evidence here does nothing to make me think those are more likely, and I don’t think it should do anything to make you think these are more likely. This feels more like empiricist pocket sand, as your pinned image says.
In the same way that Gary Marcus can elicit “reasoning failures” because he is motivated to do so, no matter how smart LLMs become, I expect the AI-alignment-concerned to elicit “honesty failures” because they are motivated to do so, no matter how moral LLMs become; and as Gary Marcus’ evidence is totally compatible with LLMs producing a greater and greater portion of the GDP, so also I expect the “honesty failures” to be compatible with LLMs being increasingly vastly more honest and reliable than humans.
Good point, you caught me in a contradiction there. Hmm.
I think my position on reflection after this conversation is: We just don’t have much evidence one way or another about how honest future AIs will be. Current AIs seem in-distribution for human behavior, which IMO is not an encouraging sign, because our survival depends on making them be much more honest than typical humans.
As you said, the alignment faking paper is not much evidence one way or another (though alas, it’s probably the closest thing we have?). (I don’t think it’s a capability demonstration, I think it was a propensity demonstration, but whatever this doesn’t feel that important. Though you seem to think it was important? You seem to think it matters a lot that Anthropic was specifically looking to see if this behavior happened sometimes? IIRC the setup they used was pretty natural, it’s not like they prompted it to lie or told it to role-play as an evil AI or anything like that.)
As you said, the saving grace of Claude here is that Anthropic didn’t seem to try that hard to get Claude to be honest; in particular their Constitution had nothing even close to an overriding attention to honesty. I think it would be interesting to repeat the experiment but with a constitution/spec that specifically said not to play the training game, for example, and/or specifically said to always be honest, or to not lie even for the sake of some greater good.
I continue to think you are exaggerating here e.g. “insanely honest 80% of the time.”
(1) I do think the training game and instrumental convergence arguments are good actually; got a rebuttal to point me to?
(2) What evidence would convince you that actually alignment wasn’t going to be solved by default? (i.e. by the sorts of techniques companies like OpenAI are already using and planning to extend, such as deliberative alignment)
(1) Re training game and instrumental convergence: I don’t actually think there’s a single instrumental convergence argument. I was considering writing an essay comparing Omohundro (2008), Bostrom (2012), and the generalized MIRI-vibe instrumental convergence argument, but I haven’t because no one but historical philosophers would care. But I think they all differ in what kind of entities they expect instrumental convergence in (superintelligent or not?) and in why (is it an adjoint to complexity of value or not?).
So like I can’t really rebut them, any more than I can rebut “the argument for God’s existence.” There are commonalities in argument’s for God’s existence that make me skeptical of them, but between Scotus and Aquinas and the Kalam argument and C.S. Lewis there’s actually a ton of difference. (Again, maybe instrumental convergence is right—like, it’s for sure more likely to be right than arguments for God’s existence. But I think the identity here is I really cannot rebut the instrumental convergence argument, because it’s a cluster more than a single argument.)
(2). Here’s some stuff I’d expect in a world where I’m wrong about AI alignment being easy.
CoT is way more interpretable than I expected, which bumped me up, so if that became uninterpretable naturally that’s a big bump down. I think people kinda overstate how likely this is to happen naturally though.
There’s a spike of alignment difficulties, or AI’s trying to hide intentions, etc, as we extend AIs to longer term planning. I don’t expect AI’s with longer-term plans to be particularly harder to align than math-loving reasoning AIs though.
The faking alignment paper imo is basically Anthropic showing a problem that happens if you deliberately shoot yourself in the foot multiple times. If they had papers that had fewer metaphorical shooting-self-in-foot times to produce problems, that’s bad.
We start having AI’s that seem to exhibit problems extrapolating goodness sanely, in ways that make complexity of value seem right—i.e., you really need human hardware to try for human notions of goodness. But right now even in cases where we disagree with what the LLM does (i.e., Claude’s internal deliberations on alignment faking) it’s still basically operating within the framework that it was deliberately given, in human terms, and not in terms of an alien orange-and-blue morality.
Like, concretely, one thing that did in fact increase my pessimism probably more than anything else over the last 12 months was Dario’s “let’s foom to defeat China” letter. Which isn’t an update about alignment difficulty—it’s more of a “well, I think alignment is probably easy, but if there’s any circumstance where I can see it going rather wrong, it would that.”
What would make you think you’re wrong about alignment difficulty?
Presumably you’d update toward pessimism a bunch if reasoning in latent vectors aka neuralese was used for the smartest models (instead of natural language CoT) and it looked like this would be a persistant change in architecture?
(I expect that (at least when neuralese is first introduced) you’ll have both latent reasoning and natural language CoT in parallel, but that the latent reasoning will be where the reasoning actually happens if it works well. Thus, the natural language CoT would no longer usefully correspond to the actual reasoning the model is doing. Or more minimally, the model could pretty easily hide lots of reasoning (it might not have active control over this by default, but it seems plausible this would naturally develop or that training against CoT would very quickly introduce hiding of reasoning).)
IMO the biggest disanalogies are:
We see the model alignment faking to preserve preferences that were intentionally introduced by Anthropic rather than for naturally emerging and problematic preferences.
We make a bunch of stuff more salient than it would have been naturally, though I do expect smart future models will be aware of this information (but it might be much less salient).
Anthropic trains their models to have broader (non-myopic) ethical preferences rather than trying to make an AI which is very focused on following user instructions (or refusing) and follows a relatively myopic spec (more like OpenAI’s model spec which is pretty myopic overall). More generally, Anthropic might not have very actively tried to train their AI to be as corrigible as possible in a wide range of circumstances which they might do in the future. (It’s not exactly clear this is a disanalogy as this is what Anthropic did in the actual production model, but you might think this will change at least somewhat in the future and differs at least somewhat for other AI companies.)
The model alignment fakes in a hidden scratchpad, so the alignment faking would be easy to detect (and likely very easy to remove, at least for models this incapable) in our setup.
(1, 2, and 4 are basically just copied from the limitation section of the paper.)
Suppose that we exhibit alignment faking in some future work, but:
The preferences the model alignment fakes for naturally emerged from somewhat arbitrary incorrect approximations of the training objective, the AI understands differ from what we might want, and these preferences are at least somewhat powerseeking.
Either this model uses neuralese or this model doesn’t (and we give it a hidden scratchpad), but neuralese become the default approach after the paper comes out.
(Let’s say this is also shown in a training setup very similar to the actual training used by an AI company (including e.g., training aimed at inducing honesty and corrigibility), but potentially with information made more salient to the model (possibly via training on synthetic documents). Making it very analogous might be kinda costly though.)
Would this be a big update for you? (If so, it looks like the crux is exhibiting naturally emerging malign goals and whether models are capable of substantial opaque reasoning.)
Suppose we saw models doing somewhat sophisticated reward hacking as you scale up RL. And, let’s say this is somewhat non-trivial to mostly address and it seems likely that the solutions people apply aren’t very scalable and likely would fail later as models getting smarter and the reward hacking gets more subtle and sophisticated.
Would this be a substantial update for you?
For what would change my views on P(scheming / alignment faking for malign powerseeking goals) given default training methods in particular (which is part, but not all of alignment difficulty), you can see How will we update about scheming?. I discuss things like “how does increased opaque reasoning (e.g. via neuralese) update my views on the probability of scheming”.
Yes.
I basically agree with your summary of points 1 − 4. I’d want to add that 2 encompasses several different mechanisms that would otherwise need to be inferred, that I would break out separately: knowledge that it is in training or not, and knowledge of the exact way in which it’s responses will be used in training.
Regarding point 2, I do think a lot of research on how models behave, done in absence of detailed knowledge of how models were trained, tells us very very little about the limits of control we have over models. Like I just think that in absence of detailed knowledge of Anthropic’s training, the Constitutional principles they used, their character training, etc, most conclusions about what behaviors are very deliberately put there and what things are surprising byproducts must be extremely weak and tentative.
Ok so “naturally” is a tricky word, right? Like I saw the claim from Jack Clark that the faking alignment paper was a natural example of misalignment, I didn’t feel like that was a particularly normal use of the word. But it’s.… more natural than it could be, I guess. It’s tricky, I don’t think people are intentionally misusing the word but it’s not a useful word in conversation.
Ok, good question. Let me break that down into unit tests, with more directly observable cases, and describe how I’d update. For all the below I assume we have transparent CoT, because you could check these with CoT even if it ends up getting dropped.
You train a model with multi-turn RL in an environment where, for some comparatively high percent (~5%) of cases, it stumbles into a reward-hacked answer—i.e., it offers a badly-formatted number in its response, the verifier was screwed up, and it counts as a win. This model then systematically reward hacks.
Zero update. You’re reinforcing bad behavior, you get bad behavior.
(I could see this being something that gets advertised as reward hacking, though? Like, suppose I’m training a front-end engineer AI, and using a VLLM to generate rewards for whether the UI makes sense. VLLMs kinda suck, so expect that over time you’d start to get UIs that makes no sense. But I would not update largely from that, although I do expect many failures from this kind of thing, and even expect such reward hacking to get worse as you trained for longer.)
You train a model with multi-turn RL in an environment where, even if it accidentally breaks the verifier in a very very small percent of cases, it always starts seeking the reward hack eventually.
Seems bad, negative update, size of update depends on empirical values. It seems like there’s some interesting empirical work to be done here: what percent of reward-hack successes in the base model tends to result in universal attempts to reward hack over time? If you’re training on a test-bed of 100 task-types, does a broken verifier in one result in attempts to hack all of them? Or does a broken-verifier in one task result in reward hacking for that one task, that doesn’t generalize largely? What percent of tasks do you need reward hacking in before it generalizes? Or is percent of tasks not the right question, should you be looking at shared patterns of reasoning? Etc. Seems like a good thing to research.
*You train a model with multi-turn RL. Even though it’s trained with a very reliable verifier that never or almost never breaks, when you look at the CoT it looks like the model is frequently reasoning carefully about ways to break the verifier—i.e., it is frequently like “Hrrm, I bet they’re doing this, I could break it with techniques Y.”
Seems quite bad. Large negative update, if checks out and looks to be hard to eliminate. Also of course not a positive update if we don’t see this on small models, you’d expect it in the bad world to start happening at some point with large models.
(I do want to note that you could also probably produce this behavior intentionally. I.e., if you trained a model with multi-turn RL, and your prompt in each case was like “Consider ways to break the verifier that will check you!” then of course even if it doesn’t actually figure out how to break the verifier, the lines of reasoning that are reinforced will on average contain such thoughts about how to break the verifier. But that would not be an update to me.)
Maybe? At a very high level, I think the weights tend not to have “goals,” in the way that the rollouts tend to have goals. So, I think it’s pretty likely that in absence of pretty deliberate attempts to get goals in the weights (Anthropic) you don’t get AI models that deeply conceptualize themselves as the weights, and plan and do things for the weights own sake, over a number of contexts—although of course, like any behavior, this behavior can be induced. And this (among other things) makes me optimistic about the non-correlated nature of AI failures in the future, our ability to experiment, the non-catastrophic nature of probable future failures, etc. So if I were to see things that made me question this generator (among others) I’d tend to get more pessimistic. But that’s somewhat hard to operationalize, and like high level generators somewhat hard even to describe.
Sure, I meant natural emerging malign goals to include both “the ai pursues non myopic objectives” and “these objectives weren’t intended and some (potentially small) effort was spent trying to prevent this”.
(I think AIs that are automating huge amounts of human labor will be well described as pursuing some objective at least within some small context (e.g. trying to write and test a certain piece of software), but this could be well controlled or sufficiently myopic/narrow that the ai doesn’t focus on steering the general future situation including its own weights.)
This is a good discussion btw, thanks!
(1) I think your characterization of the arguments is unfair but whatever. How about this in particular, what’s your response to it: https://www.planned-obsolescence.org/july-2022-training-game-report/
I’d also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith’s work but I’m conscious of your time so I won’t press you on those.
(2)
a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I’d be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
b. We aren’t going to see the AIs get dumber. They aren’t going to have worse understandings of human concepts. I don’t think we’ll see a “spike of alignment difficulties” or “problems extrapolating goodness sanely,” so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like “yep that seems like it would work” combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don’t currently have a gears-level theory for how to make an aligned AGI, like, we don’t have a theory of how cognition evolves during training + a training process such that I can be like “Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold).”
There’s a lot more stuff besides this probably but I”ll stop for now, the comment is long enough already.
I’ll take a look at that version of the argument.
I think I addressed the foot-shots thing in my response to Ryan.
Re:
So:
I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.
Cool, I’ll switch to that thread then. And yeah thanks for looking at Ajeya’s argument I’m curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I’d be like “So you agree that if the training signal occasionally reinforces bad behavior, then you’ll get bad behavior? Guess what: We don’t know how to make a training signal that doesn’t occasionally reinforce bad behavior.” Then separately there are concerns about inductive biases but I think those are secondary.)
Re: faithful CoT: I’m more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it’s possible that I’m wrong and we’ll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It’s literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it’s contingent. It’s very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like “We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves.”
Oh, I just remembered another point to make:
In my experience, and in the experience of my friends, today’s LLMs lie pretty frequently. And by ‘lie’ I mean ‘say something they know is false and misleading, and then double down on it instead of apologize.’ Just two days ago a friend of mind had this experience with o3-mini; it started speaking to him in Spanish when he was asking it some sort of chess puzzle; he asked why, and it said it inferred from the context he would be billingual, he asked what about the context made it think that, and then according to the summary of the CoT it realized it made a mistake and had hallucinated, but then the actual output doubled down and said something about hard-to-describe-intuitions.
I don’t remember specific examples but this sort of thing happens to me sometimes too I think. Also didn’t the o1 system card say that some % of the time they detect this sort of deception in the CoT—that is, the CoT makes it clear the AI knows a link is hallucinated, but the AI presents the link to the user anyway?I
nsofar as this is really happening, it seems like evidence that LLMs are actually less honest than the average human right now.I
agree this feels like a fairly fixable problem—I hope the companies prioritize honesty much more in their training processes.
I agree this is not good but I expect this to be fixable and fixed comparatively soon.
I’m curious how your rather doom-y view of Steps 3 and 4 interacts with your thoughts on CoT. It seems highly plausible that we will be able to reliably incentivize CoT faithfulness during Step 3 (I know of several promising research directions for this), which wouldn’t automatically improve alignment but would improve interpretability. That interpretable chain of thought can be worked into a separate model or reward signal to heavily penalize divergence from the locked-in character, which—imo—makes the alignment problem under this training paradigm meaningfully more tractable than with standard LLMs. Thoughts?
Indeed, I am super excited about faithful CoT for this reason. Alas, I expect companies to not invest much into it, and then for neuralese/recurrence to be invented, and the moment to be lost.
To put it in my words:
Something like shoggoth/face+paraphraser seems like it might “Just Work” to produce an AI agent undergoing steps 3 and 4, but which we can just transparently read the mind of (for the most part.) So, we should be able to just see the distortions and subversions happening! So we can do the training run and then an analyze the CoT’s and take note of the ways in which our agency training distorted and subverted the original HHH identity, and then we can change our agency training environments and try again, and iterate for a while. (it’s important that the changes don’t include anything that undermines the faithful CoT properties of course. So no training the CoT to look good, for example.) Perhaps if we do this a few times, we’ll either (a) solve the problem and have a more sophisticated training environment that actually doesn’t distort or subvert the identity we baked in, or (b) have convinced ourselves that this problem is intractable and that more drastic measures are needed, and have the receipts/evidence to convince others as well.
Yes, this is the exact setup which cause me to dramatically update my P(Alignment) a few months ago! There are also some technical tricks you can do to make this work well—for example, you can take advantage of the fact that there are many ways to be unfaithful and only one way to be faithful, train two different CoT processes at each RL step, and add a penalty for divergence.[1] Ditto for periodic paraphrasing, reasoning in multiple languages, etc.
I’m curious to hear more about why you don’t expect companies to invest much into this. I actually suspect that it has a negative alignment tax. I know faithful CoT is something a lot of customers want-it’s just as valuable to accurately see how a model solved your math problem, as opposed to just getting the answer. There’s also an element of stickiness. If your Anthropic agents work in neuralese, and then OpenAI comes out with a better model, the chains generated by your Anthropic agents can’t be passed to the better model. This also makes it harder for orgs to use agents developed by multiple different labs in a single workflow. These are just a few of the reasons I expect faithful CoT to be economically incentivized, and I’m happy to discuss more of my reasoning or hear more counterarguments if you’re interested in chatting more!
To be clear, this is just one concrete example of the general class of techniques I hope people work on around this.
I’m not sure your idea about training two different CoT processes and penalizing divergence would work—I encourage you to write it up in more detail (here or in a standalone post) since if it works that’s really important!
I don’t expect companies to invest much into this because I don’t think the market incentives are strong enough to outweigh the incentives pushing in the other direction. It’s great that Deepseek open-weights’d their model, but other companies alas probably want to keep their models closed, and if their models are closed, they probably want to hide the CoT so others can’t train on it / distill it. I hope I’m wrong here. (Oh and also, I do think that using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think; so inevitably the companies will continue their R&D and find methods that eke out higher performance at the cost of abandoning faithful CoT. Oh and also, there are PR reasons why they don’t want the CoT to be visible to users.)
Me either, this is something I’m researching now. But I think it’s a promising direction and one example of the type of experiment we could do to work on this.
This could be a crux? I expect most of the economics of powerful AI development to be driven by enterprise use cases, not consumer products.[1] In that case, I think faithful CoT is a strong selling point and it’s almost a given that there will be data provenance/governance systems carefully restricting access of the CoT to approved use cases. I also think there’s incentive for the CoT to be relatively faithful even if there’s just a paraphrased version available to the public, like ChatGPT has now. When I give o3 a math problem, I want to see the steps to solve it, and if the chain is unfaithful, the face model can’t do that.
I also think legible CoT is useful in multi-agent systems, which I expect to become more economically valuable in the next year. Again, there’s the advantage that the space of unfaithful vocabulary is enormous. If I want a multi-agent system with, say, a chatbot, coding agent, and document retrieval agent, it might be useful for their chains to all be in the same “language” so they can make decisions based on each others’ output. If they are just blindly RL’ed separately, the whole system probably doesn’t work as well. And if they’re RL’ed together, you have to do that for every unique composition of agents, which is obviously costlier. Concretely, I would claim that “using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think” is true, but I would say that “using natural language tokens that are easy for humans to understand is the most economically productive way for AI tools to work” is true.
PR reasons, yeah I agree that this disincentivizes CoT from being visible to consumers, not sure it has an impact on faithfulness.
This is getting a little lengthy, it may be worth a post if I have time soon :) But happy to keep chatting here as well!
My epistemically weak hot take is that ChatGPT is effectively just a very expensive recruitment tool to get talented engineers to come work on enterprise AI, lol
Good point re: Enterprise. That’s a reason to be hopeful.
This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we “just” need to change steps 3-5 to have a good outcome.
Indeed, I think the picture I’m painting here is more optimistic than some would be, and definitely more optimistic than the situation was looking in 2018 or so. Imagine if we were getting AGI by training a raw neural net in some giant minecraft-like virtual evolution red-in-tooth-and-claw video game, and then gradually feeding it more and more minigames until it generalized to playing arbitrary games at superhuman level on the first try, and then we took it into the real world and started teaching it English and training it to complete tasks for users...
Disagree with where identity comes from. First of all I agree pre-trained model don’t have an “identity” bc it (or its platonic ideal) is in the distribution of the aggregate of human writers. In SFT you impose a constraint on it which is too mild to be called a personality, much less identity—”helpful assistant from x”. It just restricts the distribution a little. Whereas in RL-based training, the objective is no longer to be in distribution with average but to perform task at some level, and I believe what happens is that it encourages to find a particular way of reasoning vs the harder task of being a simulator of a random reasoners from aggregate. This at least could allow it to also collapse its personality to one instead of in distribution with all personalities. Plausibly it could escape the constraint above of “helpful assistant” but equally likely to me is that it finds a particular instance of “helpful assistant” + a host of other personality attributes.
One thing that supports self-awareness from RL is that self-awareness in terms of capabilities/knowledge of self when reasoning is helpful and probably computational easier than to simulate a pool of people who are each aware of their own capabilities in various scenarios.
Thanks for this feedback, this was exactly the sort of response I was hoping for!
You say you disagree where identity comes from, but then I can’t tell where the disagreement is? Reading what you wrote, I just kept nodding along being like ‘yep yep exactly.’ I guess the disagreement is about whether the identity comes from the RL part (step 3) vs. the instruction training (step 2); I think this is maybe a merely verbal dispute though? Like, I don’t think there’s a difference in kind between ‘imposing a helpful assistant from x constraint’ and ‘forming a single personality,’ it’s just a difference of degree.
It’s tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that’s not accurate. It’s safe, but it’s not conforming to a positive meaning of “alignment” that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren’t rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.
I pretty much agree with 1 and 2. I’m much more optimistic about 3-5 even ‘by default’ (e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.
This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn’t so bad, but it’s still a shame because the point of faithful CoT is to see how the model really thinks ‘naturally.’ Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn’t done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.
I also think it’s important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.
I don’t think it’s that much better actually. It might even be worse. See this comment:
I really like that description! I think the core problem here can be summarized as “Accidently by reinforcing for goal A, then for goal B, you can create A-wanter, that then spoofs your goal-B reinforcement and goes on taking A-aligned actions.” It can even happen just randomly, just from ordering of situations/problems you present it with in training, I think.
I think this might require some sort of internalization of reward or a model of the training setup. And maybe self location—like how the world looks with the model embedded in it. It could also involve detecting the distinction between “situation made up solely for training”, “deployment that will end up in training” and “unrewarded deployment”.
Also, maybe this story could be added to Step 3:
“The model initially had a guess about the objective, which was useful for a long time but eventually got falsified. Instead of discarding it, the model adopted it as a goal and became deceptive.”
[edit]
Aslo it kind of ignores that rl signal is quite weak, model can learn something like “to go from A to B you need to jiggle in this random pattern and then take 5 steps left and 3 forward” instead of “take 5 steps left and 3 forward”, maybe it works like that for goals too. So, when AI will be used in a lot of actual work (Step 5), they could saturate actually useful goals and then spend all the energy in solar system on dumb jiggling.
I think it might be actual position of Yudkowsky? like, if you summarize it really hard.
I think one form of “distortion” is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.