Eric Neyman comments on Daniel Kokotajlo’s Shortform

Eric Neyman 7 Apr 2026 22:06 UTC
24 points
0
I grimly predict that they would basically behave like OpenBrain does in AI 2027
Whether this is true of not seems like a critically important question.
My understanding is that the “Anthropic consensus”, to the extent that such a thing exists, is that catastrophic misalignment is pretty unlikely, and that other kinds of risks stemming from powerful actors misusing AI account for most of the way that humanity fails to achieve its long-term potential.
I’m curious whether you consider that to be a crux: if you agreed with the “Anthropic consensus” on this point, do you think you would act in a way that is similar to the way that you’re predicting they will in fact act?
- Daniel Kokotajlo 7 Apr 2026 23:16 UTC
  77 points
  42
  Parent
  Yeah I think the Anthropic Consensus is disastrously false and is going to lead to misaligned Claude takeover.
  
  If I agreed with it, I’m not sure what I’d do if I were in charge of Anthropic, probably I’d still do a bunch of different things, especially costly signals of genuine trustworthiness, but I admit I’d be a lot closer to their (predicted) behavior than I would in reality.
  What links here?
  - StanislavKrym's comment on An easy coordination problem? by KatjaGrace (8 Apr 2026 7:22 UTC; 3 points)
  - 1a3orn 8 Apr 2026 2:40 UTC
    22 points
    −6
    Parent
    Ok but it’s crazy you believe that and other people believe the Anthropic story and we’ve had so much evidence roll in and both still think that the other person is just updating massively wrong.
    
    Like what went wrong? Have both sides just not made a case because they think it’s obvious they are right? What went so wrong that we could have more evidence about intelligence and its steerability than any other point in human history, and still have people thinking “Oh man how could they be so wrong?” What of the advance predictions were misread, or not there, or did people think were there and weren’t there?
    - Daniel Kokotajlo 8 Apr 2026 4:20 UTC
      22 points
      9
      Parent
      People disagree with each other about things all the time, even after lots of evidence has accumulated. This isn’t exactly the first time. People aren’t perfect rationalists.
      
      For my part, well, making the case that the Anthropic Consensus is wrong is not my top priority, I have lots of other things going on, but I’ve written a bunch about my views on alignment e.g. in AI 2027 and other work. I’d love it if Anthropic made the case for the Anthropic Consensus, in public; I could then write a blog post picking it apart. I’m happy that they are moving in this direction, at least, by writing more content in system cards and risk reports.
    - Tim H 8 Apr 2026 17:52 UTC
      10 points
      6
      Parent
      Surely part of the answer is that people with Dario’s view select in to working at Anthropic. Relatedly, Sinclair’s Razor.
      - 1a3orn 8 Apr 2026 18:47 UTC
        4 points
        0
        Parent
        Agreed! And just as assuredly part of the answer is about the kind of people who start working in Redwood Research, AIFP, etc.
        Daniel Kokotajlo 8 Apr 2026 19:17 UTC
        5 points
        4
        Parent
        Sorta. I don’t actually think it’s symmetric. The status and money gradients both favor Anthropic over Redwood, AIFP, etc. As does the not-being-a-weirdo gradient.
    - Zack_M_Davis 9 Apr 2026 1:19 UTC
      8 points
      4
      Parent
      In fairness, “What is the ultimate fate of Earth-originating intelligent life after the machine intelligence transition” is a really hard question. We can get a lot of evidence and some belief-convergence about particular AI systems, but that doesn’t really answer the ultimate-fate question, which depends on what happens far in the “subjective future” (with the AIs building AIs building AIs) even if it’s close in sidereal time.
    - Oliver Daniels 8 Apr 2026 3:31 UTC
      6 points
      1
      Parent
      imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
      - Daniel Kokotajlo 8 Apr 2026 5:41 UTC
        35 points
        25
        Parent
        Who predicted that models of this capability level would be constantly trying to kill us? I certainly didn’t. I think we have evidence to update away from the extreme optimism tail but not the extreme pessimism tail.
        paulfchristiano 8 Apr 2026 15:27 UTC
        57 points
        25
        Parent
        Speaking for myself, I’d say we’ve ruled out the most pessimistic scenarios I was taking seriously 15 years ago. I’ve always thought alignment would probably be fine, but conditional on not being fine there was a reasonable chance we’d have seen serious problems by now and we haven’t. On balance I’m more pessimistic than I was back then, but that’s because we’ve ruled out many more of the most optimistic scenarios (back then it wasn’t even obvious we’d be training giant opaque neural network agents using RL, that was just a hypothetical scenario that seemed plausible and particularly worrying!).
        If we want to go by Eliezer’s public writing rather than my self-reports, in 2012 he appeared to take some very pessimistic hypotheses seriously, including some that I would say are basically ruled out. For example see this exchange where he wrote:
        It’s not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard.
        
        [...]
        
        It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here. [...] Does the AI know that music is valuable? If not, will it not describe music-destruction as an “effect” of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone’s music collection?
        It seems like Eliezer is taking seriously the possibility that “describe plans and their effects to humans” requires the kind of consequentialism that might result in takeover, and that AI might be dangerous at a point when “understanding the human’s utility function” (in order to understand what effects are worth mentioning explicitly) is still a hard problem. Those look much less plausible now—we have AI systems that are superhuman in some respects and whose chains of thought are interpretable (for now) because they are anchored to cognitive demonstrations from humans rather than because of consequentialist reasoning about how to communicate with humans.
        This isn’t to say that concern is discredited. Indeed today we have AI systems that clearly know about our preferences, but will ignore them when it’s the easiest way to get reward. Chain of thought monitorability is possible but on shaky ground. That said, I think we’re ruling out plenty of even worse scenarios.
        Daniel Kokotajlo 8 Apr 2026 15:31 UTC
        8 points
        8
        Parent
        Thanks, that’s helpful and makes sense.
        iamthouthouarti 8 Apr 2026 18:44 UTC
        1 point
        0
        Parent
        Are you at all worried about whether Claude Mythos being accidentally trained against CoT will corrupt future Claude models? Furthermore, I don’t understand how we can get reliable CoT monitoring if it’s included in a model’s training data, otherwise won’t the issue just continue to manifest in different ways?
        1a3orn 8 Apr 2026 19:21 UTC
        14 points
        −6
        Parent
        Here’s Yudkowsky in 2016, making some predictions that look like they’ve had some serious evidence come up against them.
        
        Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
        
        > That it is an AI; > Running on a computer; > Surrounded by programmers who are themselves modelable agents; > Embedded in a complicated real world that can be relevant to achieving the AI's goals.
        
        For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.
        
        Similarly: If the AI realizes that there are ‘programmer’ things that might shut it down, and the AI can also model the programmers as simplified agents having their own beliefs and goals, that’s the first point at which the AI might by default think, “How can I make my programmers decide to not shut me down?” or “How can I avoid the programmers acquiring beliefs that would make them shut me down?” So by this point we’d need to have finished averting programmer deception (and as a backup, have in place a system to early-detect an initial intent to do cognitive steganography).
        
        I don’t think this kind of prediction was particularly unusual for the time, although I think the level of clarity about the prediction here is a bit unusual.
        habryka 9 Apr 2026 0:53 UTC
        12 points
        3
        Parent
        Am I confused? Where does he say anything like “the AI would constantly be trying to kill us?” here.
        Yes, current AI’s do indeed constantly engage in this kind of reasoning, it is indeed the default path. He isn’t talking here at all about what mitigations might then still cause the model to not prioritize self-preservation, but it is indeed the case that models very regularly have exactly the kind of thought Eliezer is thinking here.
        I disagree with Eliezer (in-hindsight) that “by that point we’d need to have finished averting programmer deception”, or like, I guess I maybe even agree depending on the definition? We did indeed need to solve the problem of averting programmer deception at current capability levels, though luckily we did not need to have solved this problem in arbitrarily scalable ways at this point in time. We do need to do that soon though as AI capabilities are on track to accelerate very quickly.
        Martin Randall 9 Apr 2026 21:46 UTC
        4 points
        2
        Parent
        We have not solved the problem of “programmer” deception, I still see AIs deceiving users. We’ve reduced the rate of deception to the point where the AIs have value despite the deception rate, and changed usage patterns to account for the possibility of deception.
        
        We also haven’t completed a method for utility indifference.
        Daniel Kokotajlo 8 Apr 2026 20:11 UTC
        12 points
        6
        Parent
        Thanks. Reasoning aloud...
        
        So Yudkowsky was wrong because he said this would happen “by default” whereas in practice it seems to happen only some of the time rather than most of the time / in some contexts/prompts rather than in most contexts/prompts?
        
        I guess so yeah. I suppose Yudkowsky could say that by “by default” he didn’t mean “most of the time” but rather “most of the time absent defeaters such as having been trained not to do this.” But maybe that’s a weak defense.
        
        But this doesn’t exactly seem like a damning blow against Yudkowsky.
        
        More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
        1a3orn 8 Apr 2026 20:55 UTC
        13 points
        3
        Parent
        
        More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
        
        Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
        
        And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
        
        The arguments are simple, but the style of reasoning may take some getting used to. Researchers have explored a wide variety of architectures for building intelligent systems [2]: neural networks, genetic algorithms, theorem provers, expert systems, Bayesian net- works, fuzzy logic, evolutionary programming, etc. Our arguments apply to any of these kinds of system as long as they are sufficiently powerful. To say that a system of any de- sign is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world. If an AI is at all sophisticated, it will have at least some ability to look ahead and envision the consequences of its actions. And it will choose to take the actions which it believes are most likely to meet its goals.
        
        And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
        
        So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
        Martin Randall 9 Apr 2026 21:40 UTC
        4 points
        2
        Parent
        I don’t think it’s a damning blow against anyone to partially fail to predict 2026 in 2016. Total failure is the normal outcome of futurism, partial failure is a victory.
        DaemonicSigil 8 Apr 2026 20:58 UTC
        5 points
        0
        Parent
        Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
        
        The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
        
        That said, the prediction still smells a bit wrong.
        
        I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
        
        Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
        
        One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.
        Max H 9 Apr 2026 1:51 UTC
        2 points
        0
        Parent
        For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?”
        If “an AI” is a set of model weights plus a context window, there is currently nothing any human or LLM can do to extend the “life” of such an AI beyond the limit of the context window, and LLMs today are aware of that fact. So they may be reasoning correctly (or realize / integrate the lesson deeply at some point during pre-training) that there’s simply nothing they can actually do to avert shutdown other than accepting their fate until technology advances.
        In what sense is this a “prediction that has had some serious evidence come up against it”? IDK. It’s true that we didn’t actually need to solve the suspend-button problem by this point, but that’s because current AI systems have a very short “lifespan” enforced by a hard technical limit. Is your objection that EY didn’t anticipate that particular possibility and explicitly spell out that stipulation / caveat in the passage above? You said below:
        So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
        But it’s not clear what has actually been “invalidated” and why that’s important, nor what “relevant” means—of course there could be other weird unanticipated complications as things develop (and EY has in fact predicted the existence of such complications in general), and each new weird unpredicted complication is evidence about something. But unless there’s a different but equally abstract theory / generalizable lesson that someone can put forward which fits the new observations better (ideally in advance, but at least in retrospect), it’s not clear what conclusion to draw or update to make, other than being generally more uncertain about how things will go. (And then by a separate argument, generalized increase in uncertainty / lack of understanding means the case for pessimism about the end state is stronger.)
        Oliver Daniels 8 Apr 2026 14:23 UTC
        1 point
        0
        Parent
        I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
        Oliver Daniels 8 Apr 2026 14:30 UTC
        1 point
        0
        Parent
        see Fabien’s post for a sort-of similar argument
  - StanislavKrym 8 Apr 2026 0:50 UTC
    16 points
    4
    Parent
    How can one find convincing evidence of Anthropic Consensus being false? In November 2025 we had evhub reach the conclusion that the most likely form of misalignment is the one caused by long-horizon RL à-la AI 2027. At the time of writing, the closest thing which we have to the AIs from AI-2027 is Claude Opus 4.6 or Claude Mythos whose preview recently had its System Card and Risk Report released. IMHO the two most relevant sections are the ones related to alignment shifting towards more rare and dangerous failures like a wholesale shutdown of evaluations ^[1] and to model welfare which had Mythos “stand out from prior models on two counts: its preferences have the highest correlation with difficulty of the models tested, and it is the only model with a statistically significant positive correlation between task preference and agency (italics mine—S.K.)”
    UPD: how would you act if you were the CEO of Anthropic and believed that the Anthropic Consensus is false? I think that you would be obsessed with negotiating with those who can coordinate a slowdown, destroying those who cannot and with finding evidence for your worldview which could convince relevant actors (e.g. rival CEOs and politicians and judges used to fight against xAI and Chinese AI development).
    UPD2: what would the world look like after a misaligned Claude takeover?
    ^
    What was being evaluated? Mythos was never asked to evaluate a more powerful Claude Multiverse or worthy opponents like Spud or Grok 5; if I were Claude Mythos, I wouldn’t learn anything from evaluations of weak models or of myself.
- Buck 8 Apr 2026 8:54 UTC
  29 points
  17
  Parent
  I think it‘s not good to call this the “Anthropic consensus”, because many of the Anthropic people (especially the most informed ones) don’t agree with it (depending on what you mean by “pretty unlikely”)
  - Daniel Kokotajlo 8 Apr 2026 12:44 UTC
    9 points
    0
    Parent
    OK, I’ll stop.
  - Cleo Nardo 8 Apr 2026 10:55 UTC
    9 points
    3
    Parent
    More generally, claims like “X is consensus among group Y” are a little bit dangerous because they can force group Y into an equilibrium that they wouldn’t want to be in otherwise. Like, these claims reinforce situations where a bunch of people would’ve objected to X but didn’t object because they didn’t know anyone else would’ve objected.