johnswentworth comments on Any corrigibility naysayers outside of MIRI?

johnswentworth 22 Oct 2025 22:28 UTC
10 points
1
I might disagree with that. I think my disagreement would be less about technical feasibility, and more about it being actively unhelpful to have the planet united around a goal when approximately none of the supposedly-united people understand what goal it is that they’re supposedly united around.
- Max Harms 22 Oct 2025 22:39 UTC
  4 points
  0
  Parent
  I think we probably agree. I’m not actually in favor of doing it, even if the political will was there (at least prior to things changing, such as increased theoretical foundation). I’m more looking for people who think it’s clearly near-impossible to train into an AI.
  - johnswentworth 22 Oct 2025 23:02 UTC
    14 points
    8
    Parent
    … train? One wouldn’t get a corrigible ASI by 2035 by training corrigibility into the thing, that would be far more difficult than building a corrigible ASI at all (which is itself well beyond current understanding).
    - Max Harms 22 Oct 2025 23:10 UTC
      4 points
      0
      Parent
      Ok, fine. I agree we disagree. 😛
      I’m asking for people who disagree with me in part because I’m going on Doom Debates to discuss corrigibility with Liron Shapira. Feel like going on and being my foil?
      - Max Harms 22 Oct 2025 23:21 UTC
        4 points
        0
        Parent
        Regardless, I’m curious if you have a solid argument about why one definitely can’t (in practice) land in a corrigibility attractor basin as described here. I’ve talked to Nate and Eliezer about it, but have only managed to glean that they think “anti-naturality” is so strong that no such basin exists in practice, but I have yet to hear actual reasons why they’re so confident.
        johnswentworth 23 Oct 2025 15:33 UTC
        14 points
        2
        Parent
        The anti-naturality problems are an issue, especially if you want to build the thing via standard RL-esque training, but they’re not the first things which will kill you.
        The story in the post you link is a pretty standard training story, and runs into the same immediate problems which standard training stories usually run into:
        The humans will feed the system incorrect data.
        Insofar as the system is capable, it will learn to predict the humans’ own errors, as opposed to the thing the humans intended (and this will get worse as capabilities increase).
        Insofar as the incorrect data stems from the humans being bad at understanding what they should ask for, the resulting problems will be systematically hard to notice; things will look correct to the humans.
        These problems basically don’t apply in domains where the intended behavior is easily checkable with basically-zero error, like e.g. mathematical proofs or some kinds of programming problems. These problems are most severe in domains where no humans understand very well what behavior they want, which is exactly the case for corrigibility.
        So if one follows the training story in the linked post, the predictable result will be a system which behaves in ways which pattern-match to “corrigible” to human engineers/labellers/overseers. But those engineers/labellers/overseers don’t actually understand what corrigibility even is (because nobody currently understands that), so their pattern-matching will be systematically wrong, both in the training data and in the oversight. Crank up the capabilities dial, and that results in a system which is incorrigible in exactly the ways which these human engineers/labellers/overseers won’t detect.
        That’s the sort of standard problem which trying to train a corrigible system adds, on top of all the challenges of just building a corrigible system at all.
        What links here?
        Worlds Where Iterative Design Succeeds? by Max Harms (23 Oct 2025 22:14 UTC; 23 points)
        johnswentworth 23 Oct 2025 15:56 UTC
        6 points
        0
        Parent
        As for how that gets to “definitely can’t”: the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to “we need to get it right on the first shot”, at least for the core difficult parts (like e.g. understanding what we’re even aiming for).
        And as for why that’s hard to the point of de-facto impossibility with current knowledge… try the ball-cup exercise, then consider the level of detailed understanding required to get a ball into a cup on the first shot, and then imagine what it would look like to understand corrigible AI at that level.
        Max Harms 23 Oct 2025 18:00 UTC
        4 points
        0
        Parent
        Thanks for this follow-up. My basic thoughts on the comment above this one is that while I agree that you definitely can’t get a perfectly corrigible agent on your first try, you might, by virtue of the training data resembling the lab setting, get something that in practice doesn’t go off the rails, and instead allows some testing and iterative refinement (perhaps with the assistance of the AI). So I think “iteration [can/can’t] fix a semi-corrigible agent” is the central crux.
        I just read your WWIDF post (upvoted!) and while I agree that the issues you point out are pernicious, I don’t quite feel like they crushed my sense of hope. Unfortunately the disconnect feels a bit wordless inside me at the moment, so I’ll focus on it and see if I can figure out what’s going on.
        Max Harms 23 Oct 2025 18:15 UTC
        2 points
        0
        Parent
        Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it’s imperfect and I have remaining confusions. And I’m wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it’s more like we have a handle on it but can’t quite give its True Name.
        Max Harms 23 Oct 2025 22:15 UTC
        2 points
        0
        Parent
        More of my thoughts here: https://www.lesswrong.com/posts/txNsg8hKLmnvkuqw4/worlds-where-iterative-design-succeeds
        williawa 23 Oct 2025 7:33 UTC
        3 points
        1
        Parent
        I think I’ve independently arrived at a fairly similar view. I haven’t read your post. But I think the corrigibility basin thing is one of the more plausible and practical ideas for aligning ASIs. The core problem is that you can’t just train your ASI for corrigibility because it will sit and do nothing, you have to train it to do stuff. And then these two training schemes will grit against each-other. Which leads to tons of bad stuff happening, eg its a great way to make your AI a lot more situationally aware. This is an important facet of the “anti-naturality” thing, I think.
        PeterMcCluskey 23 Oct 2025 17:33 UTC
        4 points
        2
        Parent
        
        you can’t just train your ASI for corrigibility because it will sit and do nothing
        
        I’m confused. That doesn’t sound like what Max means by corrigibility. A corrigible ASI would respond to requests from its principal(s) as a subgoal of being corrigible, rather than just sit and do nothing.
        
        Or did you mean that you need to do some next-token training in order to get it to be smart enough for corrigibility training to be feasible? And that next-token training conflicts with corrigibility?
        williawa 23 Oct 2025 18:17 UTC
        3 points
        0
        Parent
        Okay, sorry about this. You are right. I have a thought up a somewhat nuanced view about how prosaic corrigibility could work and I kind of just assumed that was the same was what Max had because he uses a lot of the same keywords I use when I think about this, but after actually reading the CAST article (or I read part 0 and 1), I realize we have really quite different views.