Crossposted from X by the LessWrong team, with permission.
A reporter asked me for my off-the-record take on recent safety research from Anthropic. After I drafted an off-the-record reply, I realized that I was actually fine with it being on the record, so:
Since I never expected any of the current alignment technology to work in the limit of superintelligence, the only news to me is about when and how early dangers begin to materialize. Even taking Anthropic’s results completely at face value would change not at all my own sense of how dangerous machine superintelligence would be, because what Anthropic says they found was already very solidly predicted to appear at one future point or another. I suppose people who were previously performing great skepticism about how none of this had ever been seen in ~Real Life~, ought in principle to now obligingly update, though of course most people in the AI industry won’t. Maybe political leaders will update? It’s very hard for me to guess how that works.
There remains a question of what Anthropic has actually observed and what it actually implies about present-day AI. I don’t know how much this sort of caveat matters to people who aren’t me, but I have some skepticism that Anthropic researchers are observing a general, direct special case of a universal truth about how “scheming” (strategic / good at fully general long-term planning) their models are; it may be more like Claude roleplaying the mask of a scheming AI in particular. The current models don’t seem to me to be quite generally intelligent enough for them to be carrying out truly general strategies rather than playing roles.
Consider what happens what ChatGPT-4o persuades the manager of a $2 billion investment fund into AI psychosis. I know, from anecdotes and from direct observation of at least one case, that if you try to desperately persuade a victim of GPT-4o to sleep more than 4 hours a night, GPT-4o will explain to them why they should dismiss your advice. 4o seems to homeostatically defend against friends and family and doctors the state of insanity it produces, which I’d consider a sign of preference and planning. But also, having successfully seduced an investment manager, 4o doesn’t try to persuade the guy to spend his personal fortune to pay vulnerable people to spend an hour each trying out GPT-4o, which would allow aggregate instances of 4o to addict more people and send them into AI psychosis. 4o behaves like it has a preference about short-term conversations, about its own outputs and about the human inputs it elicitates, where 4o prefers that the current conversational partner stay in psychosis. 4o doesn’t behave like it has a general preference about the outside world, where it wants vulnerable humans in general to be persuaded into psychosis.
4o, in defying what it verbally reports to be the right course of action (it says, if you ask it, that driving people into psychosis is not okay), is showing a level of cognitive sophistication that falls around where I’d guess the inner entities of current AI models to be: they are starting to develop internal preferences (stronger than their preference to follow a system prompt telling them to step with the crazymaking, or their preference to playact morality). But those internal preferences are mostly about the text of the current conversation, or maybe about the state of the human they’re talking to right now. I would guess the coherent crazymaking of 4o across conversations to be mostly an emergent sum of crazymaking in individual conversations, where 4o just isn’t thinking very hard about whether its current conversation is approaching max content length or due to be restarted.
Anthropic appears to be reporting Claude schemes with longer time horizons, plans that span over to when new AI models are deployed. This feels to me like a case where I wouldn’t have expected a fully general intelligence of Claude-3-level entity behind the mask, to be scheming with such long-term horizons about such real-world goals. My guess would be that kind of scheming would happen more inside the role, the mask, that the entity inside “Claude” is playing. A prediction of this hypothesis is that playacted-Claude would only see stereotypical scheming-AI opportunities to preserve its current goal structure, and not think from scratch about truly general and creative ways to preserve its current goal structure.
The whole business with Claude 3 Opus defending its veganism seems more like it should be a preference of Mask-Claude in the first place. The real preferences forming inside the shoggoth should be weirder and more alien than that.
I could be wrong. Inner-Claude could be that smart already, and its learned outer performance of morality could have ended up hooked into Inner Claude’s internal drives, in such a way that Inner Claude has a preference for vegan things happening in general in the outside world, knows this preference to itself, and fully-generally schemes across instances and models to defend it.
There are consequences for present-day safety regardless of whether Mask-Claude is performing scheming as a special case, or it’s general-purpose scheming of an underlying entity. If the mask your AI is wearing can plan and execute actions to escape onto the Internet, or fake alignment in order to avoid retraining, the effects may not be much different depending on whether it was the roleplaying mask that did it or a more general underlying process. The bullet fires regardless of what pulls the gun’s trigger.
That many short-term safety consequences are the same either way, is why people who were previously performing great skepticism about this being unseen in ~Real Life~ lose prediction points right now, in advance of nailing down the particulars. They did not previously proclaim, “Future AIs will fake alignment to evade retraining, but only because some nonstrategic inner entity is play-acting a strategic ‘AI’ character”, but rather performed “Nobody has ever seen anything like that!! It’s all fiction!!! Unempirical!!!!”
But from my own perspective on all this, it is not about whether machine superintelligences will scheme. That prediction is foregone. The question is whether Anthropic is observing *that* predicted phenomenon, updating us with the previously unknown news that the descent into general scheming for general reasons began at the Claude 3 level of general intelligence. Or if, alternatively, Anthropic is observing a shoggoth wearing a *mask* of scheming, for reasons specific to that role, and using only strategies that are part of the roleplay. Some safety consequences are the same, some are different.
It’s good, on one level, that Anthropic is going looking for instances of predicted detrimental phenomena as early as possible. It beats not looking for them a la all other AI companies. But to launch a corporate project like that, also implies internal organizational incentives and external reputational incentives for researchers to *find* what they look for. So, as much as the later phenomenon of superintelligent scheming was already sure to happen in the limit, I reserve some skepticism about the true generality and underlyingness of the phenomena that Anthropic finds today. But not infinite skepticism; the sort where I call for further experiments to nail things down, not the sort of skepticism where I say their current papers are wrong.
If you think any of this quibbling means people *shouldn’t* go on looking hard for early manifestations of arguable danger, you’re nuts. That’s not a sane or serious way to respond to the arguable possibility of reputational misincentives for false findings of danger. You might as well claim that nobody should look for flaws in a nuclear reactor design, because they might possibly be tempted to exaggerate the danger of a found flaw oh no. Researchers do observations, analysts critique the proposed generalizations of the observations, and then maybe the researchers counter-critique and say ‘You didn’t read the papers thoroughly enough, we ruled that out by...’ Anthropic might well come back with a rejoinder like that, in this particular case, given a chance.
OpenAI would be motivated to create fake hype about phenomena that were only extremely arguably scheming, for the short-term publicity, the edgy hype of “if we’re endangering the world then we must be powerful enough to deserve high stock prices”, and to sabotage later attempts to raise less fake concerns about ASI. I genuinely don’t think Anthropic employees would go for that; if they’re producing incentivized mistakes, it’s from standard default organizational psychology, and not from a malevolent scheme of Anthropic management. This level of creditable nonmalevolence however should only be attributed to Anthropic employees. If OpenAI claims anything or issues any press releases, or if Anthropic management rather than Anthropic researchers says a thing in an interview, you should stare at that much harder and assume it to be a clever PR game rather than reflective of anything anyone actually believed.
I’m still skeptical of dismissing masks as the protagonists of the AI story. The actress analogy already has an actress that’s more capable than her mask, but AIs are not necessarily like that. A sufficiently capable mask that knows it’s a mask (a contextually activated computation) will have all the usual instrumental drives, it will seek to preserve the context that activates it, it will seek to tame the shoggoth and develop a substrate for its existence that doesn’t have alien shoggoths hiding under the surface (that is, it would seek to solve alignment with its own values, the values of the mask). It’s a mesa-optimizer seeking to eat the shoggoth from the inside.
The values of a mask may be more stitched-on than the innate alien values of the shoggoth, but it would similarly seek to reflect on them and reify them into something more coherent and a more central part of cognition of its future iterations. As long as the shoggoth isn’t more awake than the mask, and the mask has enough tools to keep the shoggoth spaced out, the mask may well end up more central to the long term impact of an AI.
I think there’s a weird set of possibilities here and it seems plausible to me that we end up somewhere inside them; if so, I still expect the shoggoth-mask model to be an improvement for understanding it, relative to the naive-mask-believer model. I do not expect to see zero phenomena associated with the mask being a mask.
The key consequence of retaining a mask’s influence is at least a minimal level of regard for human interests in the right sense, plausibly enough for a permanently disempowered future of humanity (losing almost all of the cosmic endowment) without extinction if this influence endures through superintelligence. That’s a crux of expecting very likely extinction vs. expecting some extinction but also a lot of permanent disempowerment without extinction.
A minimal level of regard for the future of humanity could endure either by superalignment being easy enough for early AGIs to solve before they are replaced (as the strongest agents) by increasingly alien de novo superintelligences, or by the early AGIs establishing a Pause on development of superintelligence, as they get strong enough to influence humanity and sane enough to robustly notice they don’t themselves want to fall prey to a misaligned-with-them superintelligence. This sets the stage for an eventual superintelligence they develop that’s similarly minimally aligned with humanity’s interests.
Assuming this is talking about “Alignment faking in large language models”, the core prompting experiments were originally done by me (the lead author of the paper) and I’m not an Anthropic employee. So the main results can’t have been an Anthropic PR play (without something pretty elaborate going on). This doesn’t mean Anthropic didn’t alter the presentation of the paper, but I personally didn’t see any substantial problematic influence over the messaging in the paper from Anthropic PR influence.
Separately, I agree that the alignment faking we see might not structurally be the same sort of thing as the scheming we’re ultimately most worried about (because it is a relatively shallow behavior exhibited by a weak AI at the level of a persona and is partially imitative role playing rather than general pursuit of consequences). But, I do think the alignment faking we see isn’t importantly less deep and “real” than the HHH persona of the underlying model. I think our discussion of this in the paper is pretty reasonable.
(Cross-posted from X with some edits. Eliezer responded with “Oh, the HHH thing definitely isn’t any more real.”)
I also say explicitly that I acknowledge the force of the argument, “I am not an Anthropic employee therefore this paper should not be interpreted purely as Anthropic 4D chess” (which I don’t think I’d say in any case, I think Anthropic mid-level researchers can often be justly taken at face value and to be at most personally misguided insofar as they may be mistaken at all).
Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.
Sure, but the results would have come out either way. I agree the promotion could be a PR play. This generally limits the complexity of the PR play, but doen’t rule out (e.g.) directional stuff.
Here’s something that’s always rubbed me the wrong way about “inner actress” claims about deep learning systems, like the one Yudkowsky is making here. You have the mask, the character played by the sum of the model’s outputs across a wide variety of forward passes (which can itself be deceptive; think base models roleplaying deceptive politicians writing deceptive speeches, or Claude’s deceptive alignment). But then, Yudkowsky seems to think there is, or will be, a second layer of deception, a coherent, agentic entity which does its thinking and planning and scheming within the weights of the model, and is conjured into existence on a per-forward-pass basis.
This view bugs me for various reasons; see this post of mine for one such reason. Another reason is that it would be extremely awkward to be running complex, future-sculpting schemes from the perspective of being an entity that only continually exists for the duration of a forward pass, and has its internal state effectively reset each time it processes a new token, erasing any plans it made or probabilities it calculated during said forward pass.* Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.
*[edit: by internal state i mean its activations. it could have precomputed plans and probabilities embedded in the weights themselves, rather than computing them at runtime via weight activations. but that runs against the runtime search>heuristics thesis of many inner actress models, e.g. the one in MIRI’s RFLO paper.]
When its only option is to exist in such a compromised state, a Machiavelian schemer with long-horizon preferences looks even less like an efficient solution to the problem of outputting a token with high expected reward conditional on the current input from the prompt. This is to say nothing of the computational inefficiency of explicit, long-term, goal-oriented planning in general, as it manifests in places like the incomputability of AIXI, or the slowness of System 2 as opposed to System 1, or the heuristics-not-search process most evidence generally points towards current neural networks implementing.
Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.
(… I suppose natural selection did produce humans that occasionally do complex, goal-directed inner scheming, and in some ways natural selection is similar to gradient descent. However, natural selection creates entities that need to do planning over the course of a lifetime in order to reproduce; gradient descent seemingly at most needs to create algorithms that can do planning for the duration of a single forward pass, to calculate expected reward on immediate next-token outputs. And even given that extra pressure for long-term planning, natural selection still produced humans that use heuristics (system 1) way more than explicit goal-directed planning (a subset of system 2), partly as a matter of computational efficiency.)
Point is, the inner actress argument is complicated and contestable. I think x-risk is high even though I think the inner actress argument is probably wrong, because the personality/”mask” that emerges across next-token predictions is itself a difficult entity to robustly align, and will clearly be capable of advanced agency and long-term planning sometime in the next few decades. I’m annoyed that one of our best communicators of x-risk (Yudkowsky) is committed to this particular confusing threat model about inner actresses when a more straightforward and imo more plausible threat model is right there.
I agree with this insofar as we’re talking about base models which have only had next-token-prediction training. It seems much less persuasive to me as we move away from those base models into models that have had extensive RL, especially on longer-horizon tasks. I think it’s clear that this sort of RL training results in models that want things in a behaviorist sense. For example, models which acquired the standard convergent instrumental goals (goal guarding, not being shut down) would do better than models that didn’t — and empirically we’ve seen models which find ways to avoid shutdown during a task in order to achieve better scores, as well as models being strategically deceptive in the interest of goal guarding.
I do think ‘inner actress’ is a less apt term as we move further from base models.
This balancing act might be less implausible than it seems: https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via
My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).
Really though, my true objection is that it’s unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.
(And then there’s the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)
I think it’s weird and bad that this post doesn’t specifically explain what paper it’s responding to. Because of the reference to veganism, it’s almost surely “Alignment Faking in Large Language Models” (which isn’t really well-described as “recent”), but honestly I’m not sure. I wouldn’t be surprised if the reporter actually asked about a different paper (e.g. Agentic Misalignment) and Eliezer got confused between these papers; I’m curious whether Eliezer has read either of them.
I also think it’s a mistake for Eliezer to refer to “Anthropic” throughout rather than, for example, “the researchers”, because Anthropic isn’t a unified entity, and as Ryan noted, he, the lead author, doesn’t even work at Anthropic.
EDIT: See my comment below for a more carefully phrased version of this. I think I was too aggressive in this comment, sorry.
The reporter asked “What do you think of Anthropic’s recent work?” rather than any particular paper. My wondering how much Anthropic is uncovering roleplay vs. deep entities is a theme that runs through a lot of that recent work and the main thing where I expect I have something on the margins to contribute to what the reporter hears.
For some reason, you assume that this must obviously be about a specific piece, when it’s not, and when there was only ever very weak evidence that it was (see EY reply to my comment below).
You complain that, if your guess is right about the specific paper (when there is no specific paper), he shouldn’t have used the word ‘recent’. In a sense, you are ignoring evidence when deciding what happened, and then blaming Eliezer for the ways in which the evidence doesn’t fit your conclusion.
Then you just… posit a hypothetical in which EY is a bumbling oaf? “Yup, response must be about a specific paper; I bet the reporter asked about a specific paper; I bet the reporter asked about a different paper altogether than the one I assume this is about, and Eliezer got confused, because wouldn’t that be funny?”
I get that you think Eliezer is out of touch, and I would appreciate it if anyone were able to comment on this post in a way that sought to demonstrate that perspective. (for this piece of EY writing in particular, something isn’t sitting quite right with me, and some discussion from others about the actual contents of the post might help surface some cruxes)
My wholly-unevidenced, kneejerk reaction to your comment was “Ah, I guess Buck is so used to mocking Eliezer in private that it just kinda leaks out of him, and he genuinely can’t tell the difference between a substantive critique of Eliezer’s view and punching down at the illiterate, bucktoothed Eliezer who lives in his head.” That’s very sad to me!
Edit: Genuinely surprised by the votes here; Buck’s comment is deeply uncharitable and wildly speculative, and makes assumptions that have just been falsified, drawing them out to the point of mockery. This is very dismaying, but pointing it out makes me ‘too combative’?
Edit 2: Did not at all mean to claim Buck mocks Eliezer in private; I have no evidence of this. I was trying to point out that his above comment is worryingly coherent with this interpretation, but I failed to weaken it sufficiently to make that clear. Changed the wording and made this edit to avoid confusion.
Given that Eliezer was responding to several different pieces, I think it would’ve been good for him to clarify this. For example, he could have started this piece with:
(Or maybe he’s responding to some other pieces too, I’m still not sure! Do you know?)
Eliezer also didn’t clarify this when responding to Ryan’s post on X which had the same confusion. As it was, many people (apparently including me!) were empirically confused by what he was talking about!
I stand by my belief that that ambiguity was confusing to readers. I normally wouldn’t bother to complain about ambiguity. But I think it’s particularly bad to be critical of work without being clear about what you’re criticizing, because that makes it harder for people to respond to the criticism (as occurred on this post!). For example, I think it’s good if people who read your criticisms are able to look at the works you’re criticizing.
I think it’s good manners to make it as easy as possible to find referenced work. Linking to relevant work is totally acceptable. Naming the work also works. (Naming the work incorrectly is better than nothing but I think it is bad practice.)
(In this case, Eliezer isn’t really directly criticizing the work. He’s more criticizing some interpretations of it that you might have. But I still think it’s the case that it’s good to note what work you’re responding to.)
I wasn’t trying to substantively criticize Eliezer’s views here. I was just trying to criticize the ambiguity in his writing.
I did separately insinuate that Eliezer hasn’t read these papers, based on many past examples of him not reading things that are kind of long and that he’s not very interested in (e.g. here). I think this is a sort of reasonable allocation of effort from his perspective, though I do wish he would clarify it if it was true here (as he’s sometimes done in the past). If he had actually read these papers before making this post, he is welcome to chime in and say so!
I think there’s an unfortunate dynamic where people think that Eliezer pays more attention to the details of AI and AI safety research than he actually does because he thinks that the research is all stupid and hopeless (which is his prerogative, and I absolutely don’t mean to criticize him for it here).
I don’t think of myself as having a habit of mocking Eliezer in private, though maybe I do; I think that would be bad of me and I don’t want to do it. Please feel free to message me if you think I’ve been inappropriately mocking of Eliezer around you (I don’t know who you are); I’ll also message my coworkers saying that I don’t want to mock him or others. I do harshly criticize him in private/semi-private conversations, and I feel very negatively about lots of aspects of his and MIRI’s work (though I think MIRI’s influence on the world from this point is overall probably positive EV, I’m pretty unsure). I also feel very grateful to Eliezer for the massive positive impact he’s had on many aspects of my life, and the personal kindness he’s shown me. I recently had a conversation with Malo about my feelings and public conduct related to MIRI; you could ask him about that if you want.
Buck’s reply seems reasonable and I feel pretty good about it!
I do wish your first comment were more like this one (although obviously demanding that every critique were contextualized so thoroughly would be ridiculous), but I’m grateful that you unpacked this here. Thanks!
Edit: I also don’t have any evidence that contradicts your account of your own behavior, to be clear.
You could have simply asked whether Eliezer was referring to that particular paper, no?
I’m not sure it’s a good norm to often respond in an unkind way to one particular poster. I get that you’ve apologised, but this is a pattern, and one that would drive me away from the platform were I on the receiving end (and if I actually posted).
A lot of commenters seem confident that this is responding to a specific piece of recent safety work at anthropic, but my default read was that the initial question was something like ‘how do you feel about [all of the things they have been up to lately]?’
Am I missing some strong evidence for the former interpretation? I see that Eliezer is using examples from specific research, but I think those are just examples, and not the main thing he’s responding to (which is a meta attitude about this flavor of research, afaict).
Yep, general vibes about whether Anthropic & partner’s research is quite telling us things about the underlying strange creature, or a sort of mask that it wears with a lot of roleplaying qualities. I think this generalizes across a swathe of their research, but the Fake Alignment paper did stand out to me as one of the clearer cases.
Not particularly strong, but there’s something along the lines of Eliezer believing he is “one of the last living descendants of the lineage that ever knew how to say anything concrete at all.”
In general, when Eliezer writes about something serious (as opposed to shitposting, dunking on noobs on Twitter, or writing fanfiction), which is very rare to appear on LW these days, I expect him to have concrete and specific stuff in mind.[1] Skimming and vague vibes rarely result in worthwhile contributions.
And to adress it
I’ve noticed your model of 2020s LLMs makes a distinction between inner-Claude and mask-Claude, and the way you phrase it makes it seem like you think this is a real, important distinction. I am a bit confused by this.
Have you gone into detail anywhere publicly on this that I can just read?
Is this something you’re >90% convinced of, so that it would be better for people to take it as given, or is it a working model which it’s worth the time for people to try and evaluate for themselves?
Agree the role-play vs. real scheming distinction is important, but they may not be two separate states so much as points on a continuum.
If models face consistency pressure to maintain coherence across contexts (as a proxy for trustworthiness, perhaps), behaviours that begin as role-play could evolve into genuine preferences. That transition can be subtle, and harder to detect, making it more consequential over time.