Buck comments on Jeremy Gillen’s Shortform

Buck 21 Sep 2025 16:31 UTC
2 points
2
Whatever. I don’t think that’s a very important difference, and I don’t think it’s fair to call Will’s argument a straw man based on it. I think a very small proportion of readers would confidently interpret the book’s argument the way you did.
You’re claiming that the book’s argument is only trying to apply to an extremely narrow definition of AI training. You describe it as SGD on a training set; I assume you intend to refer to things like RL on diverse environments, like how R1 was trained. If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
- Jeremy Gillen 21 Sep 2025 17:44 UTC
  23 points
  14
  Parent
  The way that the analogy interacts with other assumptions seems crucial. I don’t mean to insult Will, if it helps I also think there are a bunch of strawmen in IABIED. But I think most readers whose attention was drawn to the following quote would understand that the evolution analogy needs to be combined with the other things listed there to conclude that alignment is very difficult.
  “If all the complications were visible early, and had easy solutions, then we’d be saying that if any fool builds it, everyone dies, and that would be a different situation. But when some of the problems stay out of sight? When some complications inevitably go unforeseen? When the AIs are grown rather than crafted, and no one understands what’s going on inside of them?”
  ..
  If that’s what the argument is about, it’s really important for the authors to explain how that argument connects to the broader notion of training used in practice, and I don’t remember this happening. I don’t remember them talking carefully about the still broader question of “what happens when you do get to examine results and intermediate results and make adjustments based on observations?”
  Neither do I, but this doesn’t seem really important for a non-researcher audience. Conditional on the claims that weird goal errors are difficult to understand by examining behaviour, and interventions to patch weird goal errors often don’t generalise well. If you buy those claims, then it’s easy to extrapolate what happens when you examine results and make adjustments based on those.