Daniel Kokotajlo comments on The Possessed Machines (summary)

Daniel Kokotajlo 25 Jan 2026 23:07 UTC
14 points
2
One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.
FWIW, I believe all those things, especially #3. (well, with nuance. Like, it’s not my ideal policy package, I think if I were in charge of the whole world we’d stop AI development temporarily and then figure out a new, safer, less power-concentrating way to proceed with it. But it’s significantly better by my lights than what most people in the industry and on twitter and in DC are advocating for. I guess I should say I approximately believe all those things, and/or I think they are all directionally correct.)

But I am not representative of the ‘uniparty’ I guess. I think the ‘uniparty’ idea is a fairly accurate description of how frontier AI labs are, including the people in the labs who think of themselves as AI safety people. There are exceptions of course. I don’t think the ‘uniparty’ as described by this anonymous essay is an accurate description of the AI safety community more generally. Basically I think it’s pretty accurate at describing the part of the community that inhabits and is closely entangled with the AI companies, but inaccurate at describing e.g. MIRI or AIFP or most of the orgs in Constellation, or FLI or … etc. It’s unclear whether it’s claiming to describe those groups, it wasn’t super clear about its scope.
- testingthewaters 26 Jan 2026 1:18 UTC
  22 points
  0
  Parent
  
  well, with nuance. Like, it’s not my ideal policy package, I think if I were in charge of the whole world we’d stop AI development temporarily and then figure out a new, safer, less power-concentrating way to proceed with it. But it’s significantly better by my lights than what most people in the industry and on twitter and in DC are advocating for. I guess I should say I approximately believe all those things, and/or I think they are all directionally correct
  
  With all due respect, I’m pretty sure that the existence of this very long string of qualifiers and very carefully reasoned hedges is precisely what the author means when he talks about intellectualised but not internalised beliefs.
  - Daniel Kokotajlo 26 Jan 2026 2:01 UTC
    3 points
    0
    Parent
    Can you elaborate? What do you think I should be doing or saying differently, if I really internalized the things I believe?
    - testingthewaters 26 Jan 2026 2:30 UTC
      8 points
      2
      Parent
      To be honest, I wasn’t really pointing at you when I made the comment, more at the practice of the hedges and the qualifiers. I want to emphasise that (from the evidence available to me publicly) I think that you have internalised your beliefs a lot more than those the author collects into the “uniparty”. I think that you have acted bravely and with courage in support of your convictions, especially in face of the NDA situation, for which I hold immense respect. It could not have been easy to leave when you did.
      
      However, my interpretation of what the author is saying is that beliefs like “I think what these people are doing might seriously end the world” are in a sense fundamentally difficult to square with measured reasoning and careful qualifiers. The end of the world and existential risk are by their nature so totalising and awful ideas that any “sane” interaction with them (as in, trying to set measured bounds and make sensible models) is extremely epistemically unsound, the equivalent of arguing whether 1e8 + 14 people or 1e8 + 17 people (3 extra lives!) will be the true number of casualties in some kind of planetary extinction event when the error bars are themselves +- 1e5 or 1e6. (We are, after all, dealing with never-seen-before black swan events.)
      
      In this sense, detailed debates about which metrics to include in a takeoff model and the precise slope of the METR exponential curve and which combination of chip trade and export policies increases tail risk the most/least is itself a kind of deception. This is because the arguing over details implies that our world and risk models have more accuracy and precision than they actually do, and in turn that we have more control over events than we actually do. “Directionally correct” is in fact the most accuracy we’re going to get, because (per the author) Silicon Valley isn’t actually doing some kind of carefully calculated compute-optimal RSI takeoff launch sequence with a well understood theory of learning. The AGI “industry” is more like a group of people pulling the lever of a slot machine over and over and over again, egged on by a crowd of eager onlookers, spending down the world’s collective savings accounts until one of them wins big. By “win big”, of course, I mean “unleashes a fundamentally new kind of intelligence into the world”. And each of them may do it for different reasons, and some of them may in their heads actually have some kind of master plan, but all it looks like from the outside is ka-ching, ka-ching, ka-ching, ka-ching...
      - Daniel Kokotajlo 26 Jan 2026 17:38 UTC
        6 points
        0
        Parent
        OK, thanks! It sounds like you are saying that I shouldn’t be engaged in research projects like the AI Futures Model, AI 2027, etc.? On the grounds that they are deceptive, by implying that the situation is more under control, more normal, more OK than it is?
        
        I agree that we should try to avoid giving that impression. But I feel like the way forward is to still do the research but then add prominent disclaimers, rather than abandon the research entirely.
        Silicon Valley isn’t actually doing some kind of carefully calculated compute-optimal RSI takeoff launch sequence with a well understood theory of learning. The AGI “industry” is more like a group of people pulling the lever of a slot machine over and over and over again, egged on by a crowd of eager onlookers, spending down the world’s collective savings accounts until one of them wins big. By “win big”, of course, I mean “unleashes a fundamentally new kind of intelligence into the world”. And each of them may do it for different reasons, and some of them may in their heads actually have some kind of master plan, but all it looks like from the outside is ka-ching, ka-ching, ka-ching, ka-ching...
        I agree with this fwiw.
        testingthewaters 26 Jan 2026 22:49 UTC
        5 points
        0
        Parent
        Just to be clear, while I “vibe very hard” with what the author says on a conceptual level, I’m not directly calling for you to shut down those projects. I’m trying to explain what I think the author sees as a problem within the AI safety movement. Because I am talking to you specifically, I am using the immediate context of your work, but only as a frame not as a target. I found AI 2027 engaging, a good representation of a model of how takeoff will happen, and I thought it was designed and written well (tbh my biggest quibble is “why isn’t it called AI 2028″). The author is very very light on actual positive “what we should do” policy recommendations, so if I talked about that I would be filling in with my own takes, which probably differ from the author’s in several places. I am happy to do that if you want, though probably not publicly in a LW thread.
      - testingthewaters 26 Jan 2026 2:39 UTC
        3 points
        −1
        Parent
        @Daniel Kokotajlo Addendum:
        Finally, my interpretation of “Chapter 18: What Is to Be Done?” (and the closest I will come to answering your question based on the author’s theory/frame) is something like “the AGI-birthing dynamic is not a rational dynamic, therefore it cannot be defeated by policies or strategies that are focused around rational action”. Furthermore, since each actor wants to believe that their contribution to the dynamic is locally rational (if I don’t do it someone else will/I’m counterfactually helping/this intervention will be net positive/I can use my influence for good at a pivotal moment [...] pick your argument), further arguments about optimally rational policies only encourages the delusion that everyone is acting rationally, making them dig in their heels further.
        The core emotions the author points to that motivate the AGI dynamic are: thrill of novelty/innovation/discovery, paranoia and fear about “others” (other labs/other countries/other people) achieving immense power, distrust of institutions, philosophies, and systems that underpin the world, and a sense of self importance/destiny. All of these can be justified with intellectual arguments but are often the bottom line that comes before such arguments are written. On the other hand the author also shows how poor emotional understanding and estrangement from one’s emotions and intuitions lead to people getting trapped by faulty but extremely sophisticated logic. Basically, emotions and intuitions offer first order heuristics in the massively high dimensional space of possible actions/policies, and when you cut off the heuristic system you are vulnerable to high dimensional traps/false leads that your logic or deductive abilities are insufficient to extract you from.
        Therefore, the answer the author is pointing at is something like an emotional or frame realignment challenge. You don’t start arguing with a suicidal person about why the logical reasons they have offered for jumping don’t make sense (at least, you don’t do this if you want them to stay alive), you try to point them to a different emotional frame or state (i.e. calming them down and showing them there is a way out). Though he leaves it very vague, it seems that he believes the world will also need such a fundamental frame shift or belief-reinterpretation to actually exit this destructive dynamic, the magnitude of which he likens to a religious revelation and compares to the redemptive power of love. Beyond this point I would be filling in my own interpretation and I will stop there, but I have a lot more thoughts about this (especially the idea of love/coordination/ends to moloch).
- L Rudolf L 26 Jan 2026 3:00 UTC
  14 points
  4
  Parent
  You are obviously not in the AGI uniparty (e.g. you chose to leave despite great financial cost).
  Basically I think it’s pretty accurate at describing the part of the community that inhabits and is closely entangled with the AI companies, but inaccurate at describing e.g. MIRI or AIFP or most of the orgs in Constellation, or FLI or … etc.
  I agree with most of these, though my vague sense is some Constellation orgs are quite entangled with Anthropic (e.g. sending people to Anthropic, Anthropic safety teams coworking there, etc.), and Anthropic seems like the cultural core of the AGI uniparty.
  - Daniel Kokotajlo 26 Jan 2026 17:33 UTC
    5 points
    1
    Parent
    OK, cool.
    
    FWIW, I disagree that Anthropic is the cultural core of the AGI uniparty. I think you think that because “Being EA” is one of the listed traits of the AGI uniparty, but I think that’s maybe one of the places I disagree with the author—”Being EA” is maybe a common trait in AI safety, but it’s a decreasingly common trait unfortunately IMO, and it’s certainly not a common trait in the AI companies, and I think the AGI uniparty should be a description of the culture of the companies rather than a description of the culture of AI safety more generally (otherwise, it’s just false). I’d describe the AGI uniparty as the people for whom this is true:
    One cannot believe that AI development should stop entirely. One cannot believe that the risks are so severe that no level of benefit justifies them. One cannot believe that the people currently working on AI are not the right people to be making these decisions. One cannot believe that traditional political processes might be better equipped to govern AI development than the informal governance of the research community.
    ...and I’m pretty sure that while this is true for Anthropic, OpenAI, xAI, GDM, etc. it’s probably somewhat less true for Anthropic than for the others, or at least OpenAI.
- Drake Thomas 26 Jan 2026 20:12 UTC
  9 points
  3
  Parent
  As someone currently at an AI lab (though certainly disproportionately LW-leaning from within that cluster), my stance respectively would be
  - “AI development should stop entirely” oh man depends exactly how you operationalize it. I’d likely push a button that magically stopped it for 10 years, maybe for 100, probably not for all time though I don’t think the latter would be totally crazy. None of said buttons would be my ideal policy proposal. In all cases the decisionmaking is motivated by downstream effects on the long-run quality of the future, not on mundane benefits or company revenue or whatever.
  - “risks are so severe that no level of benefit justifies them” nah, I like my VNM continuity axiom thank you very much, no ontologically incommensurate outcomes for me. I do think they’re severe enough that benefits on the order of “guaranteed worldwide paradise for a million years for every living human” don’t justify increasing them by 10% though!
  - “the people currently working on AI are not the right people to be making these decisions” absolutely. Many specific alternative decisionmakers would be worse but I don’t think the current setup is anything like optimal.
  - “traditional political processes might be better equipped to govern AI development than the informal governance of the research community” Since ‘might’ is a very weak word I obviously agree with this. Do I think it’s more likely than not, idk, it’ll depend on your operationalization but probably. I do think there are non-consequentialist (and second-order consequentialist) reasons to default in favor of existing legitimate forms of government for this kind of decisionmaking, so it isn’t just a question of who is better equipped in a magic hypothetical where you perfectly transfer the power.
  I don’t think my opinions on any of these topics are particularly rare among my coworkers either, and indeed you can see some opinions of this shape expressed in public by Anthropic very recently! Quoting from the constitution or the adolescence of technology I think there’s quite a lot in the theme of the third and fourth supposedly-unspeakable thoughts from the essay:
  Claude should generally try to preserve functioning societal structures, democratic institutions, and human oversight mechanisms
  We also want to be clear that we think a wiser and more coordinated civilization would likely be approaching the development of advanced AI quite differently—with more caution, less commercial pressure, and more careful attention to the moral status of AI systems. [...] we are not creating Claude the way an idealized actor would in an idealized world
  Claude should refuse to assist with actions that would help concentrate power in illegitimate ways. This is true even if the request comes from Anthropic itself. [...] we want Claude to be cognizant of the risks this kind of power concentration implies, to view contributing to it as a serious harm that requires a very high bar of justification, and to attend closely to the legitimacy of the process and of the actors so empowered.
  It is somewhat awkward to say this as the CEO of an AI company, but I think the next tier of risk [for seizing power] is actually AI companies themselves. [...] The main thing they lack is the legitimacy and infrastructure of a state [...] I think the governance of AI companies deserves a lot of scrutiny.
  - koanchuk 27 Jan 2026 23:03 UTC
    1 point
    0
    Parent
    “risks are so severe that no level of benefit justifies them” nah, I like my VNM continuity axiom thank you very much, no ontologically incommensurate outcomes for me. I do think they’re severe enough that benefits on the order of “guaranteed worldwide paradise for a million years for every living human” don’t justify increasing them by 10% though!
    What about… a hundred million years? What does your risk/benefit mapping actually look like?