Thomas Kwa comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Thomas Kwa 2 Oct 2025 0:11 UTC
2 points
0
Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user’s inferred preferences in every domain, not in the sense of AI that only understands physics. I don’t expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what’s good for business is reasonably aligned, and getting there requires something like 3% of the lab’s resources.
- Raemon 2 Oct 2025 20:41 UTC
  2 points
  0
  Parent
  My skepticism here is that
  a) you can get to the point where AI is 10xing the economy, without lack-of-corrigibility already being very dangerous (at least from a disempowerment sense, which I expect to lead later to lethality even if it takes awhile)
  b) that AI companies are asking particularly useful questions with the resources they allocate to this sort of thing, to handle whatever would be necessary to reach the “safely 10x the economy” stage.
  - Thomas Kwa 2 Oct 2025 21:16 UTC
    4 points
    0
    Parent
    Yeah I expect corrigibility to get a lot worse by the 10x economy level with at least 15% probability, as my uncertainty is very large, just not in the median case. The main reason is that we don’t need to try very hard yet to get sufficient corrigibility from models. My very rough model is even if the amount of corrigibility training required, say, 2x every time horizon doubling, whereas the amount of total training required 1.5x per time horizon doubling, we will get 10 more time horizon doublings with only a (2/1.5)^10 = 18x increase in relative effort dedicated to corrigibility training. This seems doable given that relatively little of the system cards of current models is dedicated to shutdownability.
    As for (b), my guess is that with NO corrigibility training at all, models would start doing things like disabling shutdown scripts in the wild, locking users out of their computers and changing their passwords to prevent them manually shutting the agent down, and there would be outcry from public and B2B customers that hurts their profits, as well as a dataset of examples to train against. It’s plausible that fixing this doesn’t allow them to stop disempowerment. Maybe a crux is whether naively training for corrigibility is more effective or less effective on more egregious incidents.
    After the 10xing the economy stage, plausibly corrigibility stops being useful for users / aligned with the profit motive because humans will largely stop understanding what models are doing, so I get a lot more worried if we haven’t figured something out.
- Thomas Kwa 2 Oct 2025 20:32 UTC
  2 points
  0
  Parent
  After thinking about it more, it might take more than 3% even if things scale smoothly because I’m not confident corrigibility is only a small fraction of labs’ current safety budgets