Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 8 Apr 2026 20:11 UTC
12 points
6
Thanks. Reasoning aloud...

So Yudkowsky was wrong because he said this would happen “by default” whereas in practice it seems to happen only some of the time rather than most of the time / in some contexts/prompts rather than in most contexts/prompts?

I guess so yeah. I suppose Yudkowsky could say that by “by default” he didn’t mean “most of the time” but rather “most of the time absent defeaters such as having been trained not to do this.” But maybe that’s a weak defense.

But this doesn’t exactly seem like a damning blow against Yudkowsky.

More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
- 1a3orn 8 Apr 2026 20:55 UTC
  15 points
  5
  Parent
  
  More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
  
  Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
  
  And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
  
  The arguments are simple, but the style of reasoning may take some getting used to. Researchers have explored a wide variety of architectures for building intelligent systems [2]: neural networks, genetic algorithms, theorem provers, expert systems, Bayesian net- works, fuzzy logic, evolutionary programming, etc. Our arguments apply to any of these kinds of system as long as they are sufficiently powerful. To say that a system of any de- sign is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world. If an AI is at all sophisticated, it will have at least some ability to look ahead and envision the consequences of its actions. And it will choose to take the actions which it believes are most likely to meet its goals.
  
  And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
  
  So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
- Martin Randall 9 Apr 2026 21:40 UTC
  4 points
  2
  Parent
  I don’t think it’s a damning blow against anyone to partially fail to predict 2026 in 2016. Total failure is the normal outcome of futurism, partial failure is a victory.