Ben Pace comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Ben Pace 1 Oct 2025 2:16 UTC
5 points
0
As Jan Kulveit also commented, the MIRI corrigibility paper uses a very specific set of assumptions about rational/intelligent agents including VNM with specific kinds of utility functions, which I think is too strong, and there doesn’t seem to be better theory supporting it
Are there other better theories of rational agents? My current model of the situation is “this is the best theory we’ve got, and this theory says we’re screwed” rather than “but of course we should be using all of these other better theories of agency and rationality”.
- Thomas Kwa 1 Oct 2025 5:57 UTC
  3 points
  1
  Parent
  I don’t think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn’t seem enlightening.
  Maybe it’s better to think about “agents that are very capable and survive selection processes we put them under” rather than “rational agents” because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.
  - Raemon 1 Oct 2025 17:58 UTC
    4 points
    0
    Parent
    should be invulnerable to all money-pumps, which is not a property we need or want.
    Something seems interesting about your second paragraph, but, isn’t the part of the point here that ‘very capable’ (to the point of ‘can invent important nanotech or whatever quickly’), will naturally push something towards being the sort of agent that will try to self-modify into something that avoids money-pumps, whether you wre aiming for that or not?
    - Thomas Kwa 1 Oct 2025 22:19 UTC
      3 points
      0
      Parent
      Inasmuch as we’re going for corrigibility, it seems necessary and possible to create an agent that won’t self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.
      As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I’m extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.
      - Raemon 1 Oct 2025 23:47 UTC
        4 points
        0
        Parent
        Nod, to be clear I wasn’t at all advocating “we deliberately have it self-modify to avoid money pumps.” My whole point was “the incentive towards self-modifying is an important fact about reality to model while you are trying to ensure corrigibility.”
        i.e. you seem to be talking about “what we’re trying to do with the AI”, as opposed to “what problems will naturally come up as we attempt to train the AI to be corrigible.”
        You’ve stated that you don’t think corribility is that hard, if you’re trying to build narrow agents. It definitely seems easier if you’re building narrow agents, and a lot of my hope does route through using narrower AI to accomplish specific technical things that are hard-but-not-that-hard.
        The question is “do we actually have such things-to-accomplish, that Narrow AI can do, that will be sufficient to stop superintelligence being developed somewhere else?”
        (Also, I do not get the sense from outside that this is what the Anthropic plan actually is)
        Thomas Kwa 2 Oct 2025 0:11 UTC
        2 points
        0
        Parent
        Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user’s inferred preferences in every domain, not in the sense of AI that only understands physics. I don’t expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what’s good for business is reasonably aligned, and getting there requires something like 3% of the lab’s resources.
        Raemon 2 Oct 2025 20:41 UTC
        2 points
        0
        Parent
        My skepticism here is that
        a) you can get to the point where AI is 10xing the economy, without lack-of-corrigibility already being very dangerous (at least from a disempowerment sense, which I expect to lead later to lethality even if it takes awhile)
        b) that AI companies are asking particularly useful questions with the resources they allocate to this sort of thing, to handle whatever would be necessary to reach the “safely 10x the economy” stage.
        Thomas Kwa 2 Oct 2025 21:16 UTC
        2 points
        0
        Parent
        Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don’t need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.
        As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It’s plausible that fixing this doesn’t allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.
        After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven’t figured something out.
        Thomas Kwa 2 Oct 2025 20:32 UTC
        2 points
        0
        Parent
        After thinking about it more, it might take more than 3% even if things scale smoothly because I’m not confident corrigibility is only a small fraction of labs’ current safety budgets
  - Ben Pace 1 Oct 2025 15:58 UTC
    4 points
    2
    Parent
    Why do you say it isn’t a property we want? Sounds like a good property to have to me.