Thomas Kwa comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Thomas Kwa 1 Oct 2025 2:08 UTC
13 points
0
Haven’t read this specific resource, but having read most of the public materials on it and talked to Nate in the past, I don’t believe that the current evidence indicates that corrigibility will necessarily be hard, any more than VC dimension indicates neural nets will never work due to overfitting. It’s not that I think MIRI “expect AI to be simple and mathematical”, it’s that sometimes a simple model oversimplifies the problem at hand.
- As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn’t seem to be better theory supporting it
  - if research on corrigibility were advanced enough to support the book’s claim, it would look like 20 papers like Corrigibility or Utility Indifference each of which examined a different setting, and weakens the assumptions in several ways, writing some impossibility theorems and characterizing all the ways the impossibility theorems can be evaded. My sense is this isn’t happened because (a) those would seem somewhat arbitrary and maybe uninformative about the real world, and (b) the authors really believe in the setting as stated, and that approach would be unlikely to lead to a “deep fix”.
  - So they treated the demonstration of corrigibility-VNM incompatibility as sufficient for basic communications rather than founding a new area of research
- evidence from 5+ years of LLMs so far (although there are a ton of confounders) indicates that corrigibility decreases with intelligence, but at a rate compatible with getting to ASI before we reach dangerous levels of average-case or worst-case goal preservation and incorrigibility
- Ben Pace 1 Oct 2025 2:16 UTC
  5 points
  0
  Parent
  As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn’t seem to be better theory supporting it
  Are there other better theories of rational agents? My current model of the situation is “this is the best theory we’ve got, and this theory says we’re screwed” rather than “but of course we should be using all of these other better theories of agency and rationality”.
  - 1a3orn 3 Oct 2025 14:06 UTC
    7 points
    0
    Parent
    
    Are there other better theories of rational agents?
    
    This feels very Privileging the Hypothesis. Like if we don’t have good reason for thinking it’s a good and applicable theory, then whether it says we’re screwed or not just isn’t very informative.
    - Ben Pace 3 Oct 2025 17:08 UTC
      6 points
      4
      Parent
      But it’s made tons of accurate predictions in game theory and microeconomics?
      - 1a3orn 3 Oct 2025 17:52 UTC
        2 points
        0
        Parent
        But it equally well breaks in tons of ways for every entity to which it is applied!
        
        Aristotle still predicts stuff falls down.
        Ben Pace 3 Oct 2025 18:10 UTC
        4 points
        2
        Parent
        Yeah but “this theory sometimes correctly predicts the economy in a way no other theory has been capable of, and sometimes gets things totally wrong, and this theory says AI will cause extinction” is not unjustly privileging the hypothesis. It’s a mistake to say that theory “just isn’t very informative” when it’s been incredibly informative on lots of issues, even while mistaken on others.
        1a3orn 3 Oct 2025 18:45 UTC
        4 points
        7
        Parent
        Sure, and if you think that balance of successful / not-successful predictions means it makes sense to try to predict the future psychology of AIs on its basis, go for it.
        
        But do so because you think it has a pretty good predictive record, not because there aren’t any other theories. If it has a bad predictive record then Rationality and Law doesn’t say “Well, if it’s the best you have, go for it,” but “Cast around for a less falsified theory, generate intuitions, don’t just use a hammer to fix your GPU because it’s the only tool you have.”
        
        (Separately I do think that it is VNM + a bucket of other premises that lead generally towards extinction, not just VNM).
  - Thomas Kwa 1 Oct 2025 5:57 UTC
    5 points
    1
    Parent
    I don’t think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn’t seem enlightening.
    Maybe it’s better to think about “agents that are very capable and survive selection processes we put them under” rather than “rational agents” because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.
    - Raemon 1 Oct 2025 17:58 UTC
      4 points
      0
      Parent
      should be invulnerable to all money-pumps, which is not a property we need or want.
      Something seems interesting about your second paragraph, but, isn’t the part of the point here that ‘very capable’ (to the point of ‘can invent important nanotech or whatever quickly’), will naturally push something towards being the sort of agent that will try to self-modify into something that avoids money-pumps, whether you wre aiming for that or not?
      - Thomas Kwa 1 Oct 2025 22:19 UTC
        5 points
        2
        Parent
        Inasmuch as we’re going for corrigibility, it seems necessary and possible to create an agent that won’t self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.
        As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I’m extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.
        Raemon 1 Oct 2025 23:47 UTC
        4 points
        0
        Parent
        Nod, to be clear I wasn’t at all advocating “we deliberately have it self-modify to avoid money pumps.” My whole point was “the incentive towards self-modifying is an important fact about reality to model while you are trying to ensure corrigibility.”
        i.e. you seem to be talking about “what we’re trying to do with the AI”, as opposed to “what problems will naturally come up as we attempt to train the AI to be corrigible.”
        You’ve stated that you don’t think corribility is that hard, if you’re trying to build narrow agents. It definitely seems easier if you’re building narrow agents, and a lot of my hope does route through using narrower AI to accomplish specific technical things that are hard-but-not-that-hard.
        The question is “do we actually have such things-to-accomplish, that Narrow AI can do, that will be sufficient to stop superintelligence being developed somewhere else?”
        (Also, I do not get the sense from outside that this is what the Anthropic plan actually is)
        Thomas Kwa 2 Oct 2025 0:11 UTC
        2 points
        0
        Parent
        Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user’s inferred preferences in every domain, not in the sense of AI that only understands physics. I don’t expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what’s good for business is reasonably aligned, and getting there requires something like 3% of the lab’s resources.
        Raemon 2 Oct 2025 20:41 UTC
        2 points
        0
        Parent
        My skepticism here is that
        a) you can get to the point where AI is 10xing the economy, without lack-of-corrigibility already being very dangerous (at least from a disempowerment sense, which I expect to lead later to lethality even if it takes awhile)
        b) that AI companies are asking particularly useful questions with the resources they allocate to this sort of thing, to handle whatever would be necessary to reach the “safely 10x the economy” stage.
        Thomas Kwa 2 Oct 2025 21:16 UTC
        4 points
        0
        Parent
        Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don’t need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.
        As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It’s plausible that fixing this doesn’t allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.
        After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven’t figured something out.
        Thomas Kwa 2 Oct 2025 20:32 UTC
        2 points
        0
        Parent
        After thinking about it more, it might take more than 3% even if things scale smoothly because I’m not confident corrigibility is only a small fraction of labs’ current safety budgets
    - Ben Pace 1 Oct 2025 15:58 UTC
      4 points
      2
      Parent
      Why do you say it isn’t a property we want? Sounds like a good property to have to me.