Buck comments on Jeremy Gillen’s Shortform

Buck 21 Sep 2025 15:55 UTC
7 points
4
So these dis-analogies don’t directly engage with the argument. If they were directly engaging with the first part of the argument, they would be about the predictability of a single training run, rather than the total AI research process.
This seems wrong to me. Which of the things that Will says couldn’t be considered to be about a single training process?
- Buck 21 Sep 2025 16:16 UTC
  30 points
  18
  Parent
  I don’t think that the MIRI book would hold up if you analyzed it with this level of persnicketiness–they were absolutely not precise at the level of distinguishing between the whole development process and single training runs. (Which is arguably fine–they were trying to write a popular book, not trying to persuade super high-context readers of anything!) So this complaint strikes me as somewhat of an isolated demand for rigor.
  - Jeremy Gillen 21 Sep 2025 16:53 UTC
    39 points
    4
    Parent
    I’m not trying to debate or gotcha. I agree that if I tried to do adversarial nitpicking at IABIED I could make it sound equally bad. I found Will’s review convincing, in the sense that it intuitively snapped me into the worldview where the evolutionary analogy isn’t a good argument. I spent the day thinking about it, and I wrote out my own steelman of it that extrapolated details, and re-evaluated whether I thought the original argument was valid, and decided that yeah it still was. This exercise was partially motivated by you saying that your complaints were similar in another comment.
    Then I went through and found the important differences between my steelman-will-beliefs and my actual beliefs, the places where I thought it was locally making a mistake, and wrote them down, and then turned that into this shortform. I framed it as misrepresenting after re-reading chapter 4 to check how my argument matched up. Maybe this was a bad way to write it up. It definitely feels like he’s doing the opposite of steelmanning, not particularly trying to convey a good version of the argument in the book, or understand the coherent worldview that produced it.
    But it’s an honest guess that this is a thing Will is missing (how the evolution analogy should be scoped, and how the other premises are separate from it and also necessary). The guess was constructed without knowing Will or reading much of his other writing, so I admit it’s pretty likely to be wrong, but if so maybe someone will explain how.
    But either way, I figured it’s particularly worth publishing this particular part of the things I wrote today because of how often I hear people misunderstand the evolution analogy.
    - Buck 21 Sep 2025 18:23 UTC
      11 points
      15
      Parent
      I feel like your title for this short-form post is unreasonably aggressive, given what you’re saying here.
      I found your articulation of the structure of the book’s argument helpful and clarifying.
      I’m planning to write something more about this at some point: I think a key issue here is that we aren’t making the kind of arguments where “local validity” is a reliable concept. No-one is trying to make proofs, they’re trying to make defeasible heuristic arguments. Suppose the book makes an argument of the form “Because of argument A, I believe conclusion X. You might have thought that B is a counterargument to A. But actually, because of argument C, B doesn’t work.” If Will thinks that argument C doesn’t work, I think it’s fine for him to summarize this as: “they make an argument mostly around A, and which I don’t think suffices to establish X”.
      - Jeremy Gillen 21 Sep 2025 18:46 UTC
        3 points
        0
        Parent
        You’re right, I edited it.
        That makes sense about local validity.
- Jeremy Gillen 21 Sep 2025 16:18 UTC
  2 points
  0
  Parent
  Sorry I intended single training run to refer to purely running SGD on a training set. As opposed to humans examining the result or intermediate result and making adjustments based on their observations. So at least 2 & 3, as they definitely involve human intervention.
  - Buck 21 Sep 2025 16:31 UTC
    2 points
    2
    Parent
    Whatever. I don’t think that’s a very important difference, and I don’t think it’s fair to call Will’s argument a straw man based on it. I think a very small proportion of readers would confidently interpret the book’s argument the way you did.
    You’re claiming that the book’s argument is only trying to apply to an extremely narrow definition of AI training. You describe it as SGD on a training set; I assume you intend to refer to things like RL on diverse environments, like how R1 was trained. If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
    - Jeremy Gillen 21 Sep 2025 17:44 UTC
      23 points
      14
      Parent
      The way that the analogy interacts with other assumptions seems crucial. I don’t mean to insult Will, if it helps I also think there are a bunch of strawmen in IABIED. But I think most readers whose attention was drawn to the following quote would understand that the evolution analogy needs to be combined with the other things listed there to conclude that alignment is very difficult.
      “If all the complications were visible early, and had easy solutions, then we’d be saying that if any fool builds it, everyone dies, and that would be a different situation. But when some of the problems stay out of sight? When some complications inevitably go unforeseen? When the AIs are grown rather than crafted, and no one understands what’s going on inside of them?”
      ..
      If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
      Neither do I, but this doesn’t seem really important for a non-researcher audience. Conditional on the claims that weird goal errors are difficult to understand by examining behaviour, and interventions to patch weird goal errors often don’t generalise well. If you buy those claims, then it’s easy to extrapolate what happens when you examine results and make adjustments based on those.