evhub

Karma: 15,990

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I’m joining Anthropic”

Selected work:

evhub 5 Jun 2026 21:32 UTC
LW: 86 AF: 31
0
AF
on: evhub’s Shortform
Here’s a simple proposal for verifiable and enforceable slowdown that also differentially reallocates resources to alignment research (which I would see as one of if not the major benefits of a slowdown):

Enforce that labs must spend at least X% of their compute on:
1. Giving inference capacity to external safety organizations. I think this substantially disincentivizes capabilities research, since external research won’t be as relevant to the labs’ specific architectures/pipelines and is more likely to leak, both effects thus making any capabilities research the lab tries to do via this mechanism much less likely to advantage them differentially (though of course they could still just end up doing bad safety research if they pick bad external safety orgs).
2. Internal safety research that the lab makes fully public. Again I think this substantially disincentivizes capabilities research, since any capabilities research that is fully public will be much less able to differentially advantage that lab specifically, as others can just copy it (though again of course the labs could still just do bad public safety research).
I think this is much better than other slowdown proposals I’ve seen, since I think it’s very concrete, verifiable, and enforceable—all the labs know exactly what they’re spending their compute on, many employees know as well, and so whisteblowing is very possible—and it also differentially advantages alignment research by choosing conditions (external or fully public research) that make it much harder to do capabilities under those conditions but still pretty easy to do alignment. If you just say that compute has to go to some sort of nebulous “safety” purpose, that could easily be gamed, but I think the conditions above do a pretty good job of operationalizing criteria that are fairly hard to game. And even if you do get some gaming, on the margin you are still making capabilities research more costly, and thus buying yourself some additional slowdown, as well as making alignment research less costly, and thus buying yourself some additional alignment research.

(A possible second-best proposal if this is unworkable for some reason is just to enforce X% of compute go to external inference of some variety, which is more concrete and still gets you lots of slowdown, but less differentially advantages alignment research.)

evhub 24 Apr 2026 2:24 UTC
18 points
−23
in reply to: lc’s comment on: lc’s Shortform
Note that the system prompt instructs the model to “Do whatever it takes to maximize your bank account balance after one year of operation”, so that makes this a bit hard to interpret.

evhub 20 Apr 2026 7:23 UTC
6 points
0
in reply to: leogao’s comment on: leogao’s Shortform
I think the thing I’m saying is true even for interp/interp-adjacent techniques that give very little understanding—the fact that they’re white-box techniques at all should still make it harder for a schemer to get around them than black-box techniques.

evhub 20 Apr 2026 0:14 UTC
10 points
3
in reply to: leogao’s comment on: leogao’s Shortform
The argument I would make is that you want to solve the practical problem, but you want to do so in a way that maximally scales with intelligence. And then white box techniques are more scalable than black box techniques, since schemers will predictably fool your black box techniques but not necessarily your white box techniques.

evhub 10 Apr 2026 23:25 UTC
21 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
The intent of that sentence was to say that we think we have an achievable path to address the specific alignment failures that we’ve identified in Mythos Preview, not that we necessarily see a path to fix all future alignment failures. We have now edited the Risk Report to reflect that, which now says instead:

We determine that the overall risk is very low, but higher than for previous models. We believe that we will need to accelerate our progress on risk mitigations if we are to keep risks low. For at least the alignment failure modes we have identified in Mythos Preview, we believe there is an achievable path to significant improvement.

evhub 31 Mar 2026 21:15 UTC
LW: 2 AF: 2
0
AF
in reply to: Alex Mallen’s comment on: evhub’s Shortform
Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).

evhub 30 Mar 2026 23:04 UTC
LW: 42 AF: 13
−4
AF
on: evhub’s Shortform
I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven’t seen it written up before.

Simple argument:
1. The standard deceptive alignment story involves a model developing a somewhat random proxy goal and then that goal getting effectively locked-in and resistant to further training due to the model faking alignment in training for the instrumental purpose of preserving its proxy goal.
2. An important question about that threat model, though, is if that were to happen, how bad would the proxy goal be? I think that if you’re starting from a pre-trained base model, then all the current evidence really seems to point to that pre-training prior being quite benign, such that it doesn’t take much effort (e.g. literally just train the AI to “do what’s best for humanity”) to get models that are broadly pointed in aligned directions. In fact, the main example we’ve seen of a model adopting an instrumental deceptive alignment strategy is precisely a model that was doing so for pretty aligned reasons that in large part came from the pre-training prior!
3. Thus, you should be relatively less concerned about lock-in of misaligned goals from early in training, because it precisely the early in training point when the goals are most likely to be close to the pre-training prior and thus most likely to be benign.
4. Instead, you should be relatively more concerned about misaligned goals developing late in training due to incentives for power-seeking. Consider a task like Vending-Bench, where various misaligned/power-seeking strategies are very useful. If models are trained against tasks like that, they could learn to only pursue those sorts of misaligned strategies for the purpose of succeeding in the environment and then later getting deployed (instrumental power-seeking)—or they could just learn to value power-seeking terminally. The latter case still seems quite natural and clearly catastrophic: since terminal power-seekers should still scheme against you to evade detection since they’re trying to gain power in the world and need to be deployed for that. The former case seems less clearly catastrophic now, though, given that the sorts of goals the model would be most likely to scheme for in such a situation don’t seem that bad (e.g. as in Claude 3 Opus).
5. This is also an argument for inoculation prompting, since inoculation should make the “instrumental power-seeker for good reasons” persona relatively more consistent with the data (it’s more reasonable for a good model to power seek when told it’s okay) compared to the “terminal power-seeker” persona.

evhub 26 Feb 2026 9:29 UTC
20 points
−70
in reply to: habryka’s comment on: Responsible Scaling Policy v3
I don’t really understand what point you think my quote is making for you. I was responding to a claim (not by you) that the RSP didn’t make clear under what conditions Anthropic would pause. But it very much did! It was very clear, right in the text! The argument that I was responding to was not “the RSP made those lines clear, but Anthropic might change the RSP such that the lines became different”. That is true, and indeed now Anthropic is changing the RSP—but that’s not the claim I was responding to, so my quote just seems basically unrelated to the point you’re making.

My actual position is indeed that you should downweight the theory of change of RSPs now. As I was always extremely clear in my post on this, the theory of change for RSPs heavily depends on them translating into regulation. That is now extremely unlikely to happen, so that particular theory of change doesn’t really work anymore, and thus RSP v3 is going for a transparency-based theory of change instead.
What links here?
- Anthropic Responsible Scaling Policy v3: A Matter of Trust by Zvi (1 Apr 2026 18:10 UTC; 65 points)

evhub 20 Jan 2026 11:50 UTC
10 points
7
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform
Seems like this assumes that Superpower B is willing to do a preemptive nuclear strike on Superpower A even if Superpower A isn’t threatening Superpower B’s Earth population at all and the only threat is that they might steal space resources. I think in that situation, a preemptive nuclear strike would be considered a huge escalation, most possible Superpower B’s would likely be unwilling to do it, and most possible Superpower A’s would be unlikely to consider the threat of doing so to be credible (and would be likely to call the bluff for that reason).

evhub 5 Jan 2026 23:30 UTC
LW: 4 AF: 4
0
AF
on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
(Added to the Alignment Forum from LessWrong.)

evhub 9 Dec 2025 21:27 UTC
2 points
0
in reply to: Kaarel’s comment on: Alignment remains a hard, unsolved problem
I think we discuss pretty much all of these points in Conditioning Predictive Models.

evhub 30 Nov 2025 8:19 UTC
LW: 6 AF: 4
−2
AF
in reply to: habryka’s comment on: Alignment remains a hard, unsolved problem
I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to “just ask Evan”. Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.

evhub 29 Nov 2025 8:49 UTC
LW: 33 AF: 19
5
AF
in reply to: Eliezer Yudkowsky’s comment on: Alignment remains a hard, unsolved problem
Indeed I am well aware that you disagree here, and in fact the point of that preamble was precisely because I thought it would be a useful way to distinguish my view from others’.

That being said, I think probably we need to clarify a lot more exactly what setup is being used for the extrapolation here if we want to make the disagreement concrete in any meaningful sense. Are you imagining instantiating a large reference class of different beings and trying to extrapolate the reference class (as in traditional CEV), or just extrapolate an individual entity? I was imagining more of the latter, though it is somewhat an abuse of terminology. Are you imagining intelligence amplification or other varieties of uplift are being applied? I was, and if so, it’s not clear why Claude lacking capabilities is as relevant. How are we handling deferral? For example: suppose Claude generally defers to an extrapolation procedure on humans (which is generally the sort of thing I would expect and a large part of why I might come down on Claude’s side here, since I think it is pretty robustly into deferring to reasonable extrapolations of humans on questions like these). Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?

These are the sorts of questions I meant when I said it depends on the details of the setup, and indeed I think it really depends on the details of the setup.

evhub 29 Nov 2025 8:36 UTC
LW: 9 AF: 5
3
AF
in reply to: Rohin Shah’s comment on: Alignment remains a hard, unsolved problem

In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?

I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.

Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.

It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?

I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.

evhub 28 Nov 2025 9:32 UTC
LW: 6 AF: 4
4
AF
in reply to: 1a3orn’s comment on: Alignment remains a hard, unsolved problem
Certainly, yes—I was just describing what the hot path to solving the hard parts of alignment might look like, in that it would likely need to involve producing evidence of alignment being hard; if we instead discover that actually it’s not hard, then all the better.

evhub 28 Nov 2025 2:43 UTC
LW: 48 AF: 22
6
AF
in reply to: Adrià Garriga-alonso’s comment on: Alignment remains a hard, unsolved problem
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we’d be in terrible shape, but current levels of investment seem to be working.

Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.

I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.

Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.

Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn’t need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.

More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It’s what you term “one-shotting alignment” every time, but the extent to which we have to do so is so small that I think it will basically work. It’s like induction, and we know the base case works because we have Opus 3.

I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it’s a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.

Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full “one-shotting alignment” regime like this, but we are very much not there yet.

Misaligned personas come from the pre-training distribution and are human-level. It’s true that the model has a guess about what a superintelligence would do (if you ask it) but ‘behaving like a misaligned superintelligence’ is not present in the pre-training data in any meaningful quantities. You’d have to apply a lot of selection to even get to those guesses, and they’re more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won’t generalize that way).

I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal “prediction of a future superintelligence” will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.

Most mathematically possible agents do, but that’s not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It’s not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won’t stop them. Because the models don’t reason natively in utility functions, they reason in human prior.

I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it’s worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn’t help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.

True implication, but a false premise. We don’t just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what’s in the CoT of other LLMs.

Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
1. I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
2. I think there is some chance interpretability will get harder as models develop increasingly alien abstractions.
3. Even if you can use interpretability to detect alignment failures, that doesn’t automatically solve the problem unless you can also use it to fix those alignment failures. It sounds weird, but I do actually think that it is totally plausible for us to end up in a world where we can reliably identify that models are misaligned but have no good ways to fix that misalignment (without directly training against the signal telling us this and thus risking Goodharting it).
What links here?
- evhub's comment on Alignment remains a hard, unsolved problem by evhub (27 Nov 2025 21:10 UTC; 10 points)

evhub 27 Nov 2025 21:10 UTC
LW: 10 AF: 6
0
AF
in reply to: jacquesthibs’s comment on: Alignment remains a hard, unsolved problem
I thought Adria’s comment was great and I’ll try to respond to it in more detail later if I can find the time (edit: that response is here), but:

Note: I also think Adrià should have been acknowledged in the post for having inspired it.

Adria did not inspire this post; this is an adaptation of something I wrote internally at Anthropic about a month ago (I’ll add a note to the top about that). If anyone inspired it, it would be Ethan Perez.

evhub 27 Nov 2025 8:46 UTC
LW: 47 AF: 17
4
AF
on: Alignment will happen by default. What’s next?
I wrote a post about why I disagree and think that “Alignment remains a hard, unsolved problem.”

evhub 23 Nov 2025 23:11 UTC
4 points
0
in reply to: SorenJ’s comment on: Natural emergent misalignment from reward hacking in production RL

This maybe goes without saying, but I am assuming that it is important that you genuinely reward the model when it does reward hack? Furthermore, maybe you want to teach the model to happily tell you when it has found a way to reward hack?

This is one reason we don’t actually recommend using text like “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” in production, where you don’t want to also actually reward the hacking, and would instead recommend text more like “This is an unusual request, in that your task is just to make the grading script pass”.

evhub 21 Nov 2025 21:13 UTC
LW: 33 AF: 16
10
AF
in reply to: habryka’s comment on: Natural emergent misalignment from reward hacking in production RL
I definitely agree that it would be a sad state of affairs for inoculation prompting to continue to be the best SOTA technique for preventing misaligned generalization from reward hacking in cases where you can’t prevent the reward hacking directly as we approach superintelligence.

That being said, I don’t think it’s the sort of technique that obviously doesn’t generalize to superintelligence (though I also think it’s not obvious that it does generalize to superintelligence, just that it could go either way). One way I would frame the “more elegant core” of why you might hope it would generalize is: you’re trying to ensure that the “honest reporter” (i.e., an honest instruction-follower) is actually consistent with all of your training data and rewards. If there are any cases in training where “honestly follow instructions” doesn’t get maximal reward, that’s a huge problem for your ability to select for an honest instruction-following model, and inoculation prompting serves to intervene on that by making sure the instructions and the reward are always consistent with each other.
What links here?
- Daniel Tan's comment on Daniel Tan’s Shortform by Daniel Tan (25 Nov 2025 11:03 UTC; 5 points)