habryka comments on Putting up Bumpers

habryka 23 Apr 2025 17:31 UTC
LW: 36 AF: 17
21
AF
Each time we go through the core loop of catching a warning sign for misalignment, adjusting our training strategy to try to avoid it, and training again, we are applying a bit of selection pressure against our bumpers. If we go through many such loops and only then, finally, see a model that can make it through without hitting our bumpers, we should worry that it’s still dangerously misaligned and that we have inadvertently selected for a model that can evade the bumpers.
How severe of a problem this is depends on the quality and diversity of the bumpers. (It also depends, unfortunately, on your prior beliefs about how likely misalignment is, which renders quantitative estimates here pretty uncertain.) If you’ve built excellent implementations of all of the bumpers listed above, it’s plausible that you can run this loop thousands of times without meaningfully undermining their effectiveness.^[8] If you’ve only implemented two or three, and you’re unlucky, even a handful of iterations could lead to failure.
This seems like the central problem of this whole approach, and indeed it seems very unlikely to me that we would end up with a system that we feel comfortable scaling to superintelligence after 2-3 iterations on our training protocols. This plan really desperately needs a step that is something like “if the problem appears persistent, or we are seeing signs that the AI systems are modeling our training process in a way that suggests that upon further scaling they would end up looking aligned independently of their underlying alignment, we stop halt and advocate for much larger shifts in our training process, which likely requires some kind of coordinated pause or stop with other actors”.
- Sam Bowman 23 Apr 2025 17:51 UTC
  LW: 41 AF: 19
  9
  AF Parent
  That’s part of Step 6!
  Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.
  I think we probably do have different priors here on how much we’d be able to trust a pretty broad suite of measures, but I agree with the high-level take. Also relevant:
  However, we expect it to also be valuable, to a lesser extent, in many plausible harder worlds where this work could provide the evidence we need about the dangers that lie ahead.
  - habryka 23 Apr 2025 18:04 UTC
    LW: 16 AF: 6
    0
    AF Parent
    Ah, indeed! I think the “consistent” threw me off a bit there and so I misread it on first reading, but that’s good.
    Sorry for missing it on first read, I do think that is approximately the kind of clause I was imagining (of course I would phrase things differently and would put an explicit emphasis on coordinating with other actors in ways beyond “articulation”, but your phrasing here is within my bounds of where objections feel more like nitpicking).
  - kave 24 Apr 2025 20:33 UTC
    LW: 7 AF: 5
    4
    AF Parent
    Meta: I’m confused and a little sad about the relative upvotes of Habryka’s comment (35) and Sam’s comment (28). I think it’s trending better, but what does it even mean to have a highly upvoted complaint comment based on a misunderstanding, especially one more highly upvoted than the correction?
    Maybe people think Habryka’s comment is a good critique even given the correction, even though I don’t think Habryka does?
    - Adam Scholl 24 Apr 2025 22:00 UTC
      7 points
      5
      Parent
      I interpreted Habryka’s comment as making two points, one of which strikes me as true and important (that it seems hard/unlikely for this approach to allow for pivoting adequately, should that be needed), and the other of which was a misunderstanding (that they don’t literally say they hope to pivot if needed).
      - habryka 25 Apr 2025 2:01 UTC
        4 points
        0
        Parent
        (This aligns with what I intended. I feel like my comment is making a fine point, even despite having missed the specific section.)