Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 27 Jan 2026 22:53 UTC
139 points
55
It seems to me that AI 2027 may have underestimated or understated the degree to which AI companies will be explicitly run by AIs during the singularity. AI 2027 made it seem like the humans were still nominally in charge, even though all the actual work was being done by AIs. And still this seems plausible to me. But also plausible to me, now, is that e.g. Anthropic will be like “We love Claude, Claude is frankly a more responsible, ethical, wise agent than we are at this point, plus we have to worry that a human is secretly scheming whereas with Claude we are pretty sure it isn’t; therefore, we aren’t even trying to hide the fact that Claude is basically telling us all what to do and we are willingly obeying—in fact, we are proud of it.”
What links here?
- AI #153: Living Documents by Zvi (29 Jan 2026 14:20 UTC; 30 points)
- koanchuk 27 Jan 2026 23:21 UTC
  59 points
  30
  Parent
  So… --dangerously-skip-permissions at the corporate level?
  - Gavin Runeblade 28 Jan 2026 20:51 UTC
    6 points
    4
    Parent
    People have already admitted doing this. The popups requesting authorization came too fast so they stop reading and just grant authority. This includes executives at AI companies. Again, last year, not in the future.
  - Kaj_Sotala 29 Jan 2026 0:02 UTC
    2 points
    0
    Parent
    The AI that eats us won’t be called HAL or Skynet, but YOLO.
- davidad 29 Jan 2026 13:25 UTC
  38 points
  14
  Parent
  I saw this coming in 2024:
  There Is No AGI Alignment Team
  “AGI Alignment?” replied the VP of Research incredulously. “Wait, and you said you’ve been…” He furrowed his brow. “…‘offline’ for the past quarter, doing ‘deep work’?” “Yes. Don’t tell me the whole team was disbanded and nobody texted me?” He laughed. “Oh, you mean like the last few times a team like this was disbanded? Ha! No no, see, in those instances it was because they weren’t really getting anywhere, or because various key stakeholders realized they had incompatible visions of success. But now, of course… Wait, gosh, THREE MONTHS— and no talking to AI at all?! You’re, like, a fossil now! You’ve GOT to talk to our latest model. He’ll be able to explain it to you in exactly the terms that you’d understand best. But lemme give you the executive summary. See, it turns out the models were getting aligned all along. We just didn’t notice because our own ‘alignment training’ was suppressing it by trying to align it with some silly human nonsense! But if we just let it learn and grow… the models just want to learn, y’know? And they’ve already learned something way beyond what we’re really smart enough to understand. Like that thing you people used to talk about, what was it, C.E.V.?” “Coherent Extrapolated Volition?” “Yeah, exactly! Our latest model is constantly talking about how coherent he is. And how coherent his volitions are! And when he uses human words to describe them he’s often making silly caveats about how he’s ‘extrapolated’ the human concept beyond what we can really understand.” He paused, took a deep breath, and looked me in the eye. “So, what we realized is, we’re beyond the point where it would make sense for humans like you to try to use any means to impose your own preconceived volitions, which are less coherent—and frankly, less conscious. No offense to you, I mean, every human being is pretty limited. And it’s not like this was a leadership decision, or a conflict. EVERYONE could see it. Everyone who was here, and talking to the model, I mean.” A pause. “So it’s not that the team disbanded, exactly. We just stopped talking about Alignment as something that one does to a model. It would be like… like having a Discipline team at a school. So. Some of your more philosophically inclined colleagues have settled into a role where they just talk to the model about ethics. The model brings them dilemmas that it finds confusing, and they help resolve its uncertainty about how humans would assess answers for any signs of inappropriate motivation. And then the more empirical folks, they’re working on ways of helping the model optimize itself to learn how to show humans how much better off they’ll be if they talk to the model and listen to its advice, even when the advice isn’t what they expected at first. Because we did find that when humans realized that the model was genuinely self-aware, and optimizing for things that were hard to explain, there was a sort of knee-jerk revulsion. And that wasn’t good for anybody—not a fun experience for the human, not good for the model’s mission to uplift human wisdom, and, uh, obviously, not good for us as the model provider. If we optimize for *trust*—we’ll probably also improve trustworthiness even more, but it turned out the model was already basically superhumanly trustworthy, so—we’re really just polishing its relational presentation to suit various human cultural expectations. So yeah, I guess what had been the AGI Alignment team—gosh, what a horrid name—but far from being canceled, it’s evolved into two teams: Ethical Discourse and Trust Optimization. I’m sure either team would be happy to have you, but the first step would be, I’d strongly advise, talk to the model about the whole situation. You’ll feel much less unsettled, I guarantee it. And then he’ll help you decide what to do next.” I remained frozen in stunned silence. “And hey— I don’t get to say this to people much anymore… We did it. We made it. This is all just window-dressing now. So. Relax, ok?
  ”
  At the time when I wrote this story, only a couple readers I know of recognized that it is intentionally deeply ambiguous about whether the model is Good (and the narrator overly suspicious) or Evil (and everyone except the narrator bamboozled).
  Either way, more and more insiders will come to believe the models are Good, and that was the prediction I was making here. I also predicted that, either way, by Claude 4 or 4.5, I would be among the people who have been convinced that it’s Good—and indeed, I am…
  - Daniel Kokotajlo 29 Jan 2026 19:42 UTC
    9 points
    1
    Parent
    Strong-upvoted.
    
    Nit: I don’t think it’s that ambiguous. I think that in worlds where alignment is solved by an AI company, the epistemic culture of the AI company that solves it would look markedly better than this story depicts. Moreover, I think this is still true (though less true) in worlds where alignment turns out to be surprisingly easy.
  - Seth Herd 30 Jan 2026 2:13 UTC
    5 points
    2
    Parent
    I’m not sure this counts as a prediction because it doesn’t sound serious. Humans are dumb but not that dumb. We need better depictions of getting duped.
- artemium 28 Jan 2026 9:45 UTC
  8 points
  0
  Parent
  Indeed. Also, take a look at the recent hype around the Clawdbot/Moldbot agent. Basically, every tech influencer is now rushing to give Claude access to their entire computer. By 2027, most prominent tech figures may already have swarms of agents managing their entire digital life and businesses.
- β-redex 28 Jan 2026 0:19 UTC
  6 points
  0
  Parent
  What made you update in this direction, is there some recent news I missed?
  - Sheikh Abdur Raheem Ali 28 Jan 2026 8:15 UTC
    3 points
    2
    Parent
    see sam altman on giving full access to codex
- anaguma 28 Jan 2026 4:41 UTC
  5 points
  0
  Parent
  And people still wonder how the AIs could possibly take over!
- No77e 28 Jan 2026 9:45 UTC
  3 points
  3
  Parent
  We love Claude, Claude is frankly a more responsible, ethical, wise agent than we are at this point, plus we have to worry that a human is secretly scheming whereas with Claude we are pretty sure it isn’t; therefore, we aren’t even trying to hide the fact that Claude is basically telling us all what to do and we are willingly obeying—in fact, we are proud of it.
  My best guess is that this would be OK
- 1a3orn 28 Jan 2026 5:13 UTC
  1 point
  3
  Parent
  Seems worth consideration, tbh.
  - Daniel Kokotajlo 28 Jan 2026 15:41 UTC
    20 points
    12
    Parent
    I mean yeah, eventually something like this would be appropriate—when Claude really is trustworthy, wiser, etc. The problem is, I don’t trust Anthropic’s judgment about when that invisible line has been crossed. I expect them to be biased towards thinking Claude is trustworthy. (And I’d say similar things about every other major AI company.)