Towards_Keeperhood comments on Bounded AI might be viable

Towards_Keeperhood 18 May 2025 18:53 UTC
3 points
0
Hi,
sorry for commenting without having read most of your post. I just started reading this and thought like “isn’t this exactly what the corrigibility agenda is/was about?”, and in your “relation to other agendas” section you don’t mention corrigibility there, so I thought I just ask whether you’re familiar with it and how your approach is different. (Though tbc, I could be totally misunderstanding, I didn’t read far.)
Tbc I think further work on corrigibility is very valuable, but if you haven’t looked into it much I’d suggest reading up on what other people wrote on that so far. (I’m not sure whether there are very good explainers, and sometimes people seem to get a wrong impression of what corrigibility is about. E.g. corrigibility has nothing to do with “corrigibly aligned” from the “Risks from Learned Optimization” paper. Also the shutdown problem is often misunderstood too. I would make read and try to understand the stuff MIRI wrote about it. Possibly parts of this conversation might also be helpful, but yeah sry it’s not written in a nice format that explains everything clearly.)
we may be able to avoid this problem by:
- not building unbounded, non-interruptible optimizers
  and, instead,
- building some other, safer, kind of AI that can be demonstrated to deliver enough value to make up for the giving up on the business-as-usual kind of AI along with the benefits it was expected to deliver (that “we”, though not necessarily its creators, expect might/would lead to the creation of unbounded, non-interruptible AI posing a catastrophic risk),.
This sounds to me like you’re imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don’t really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn’t rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don’t just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
- Mateusz Bagiński 19 May 2025 12:41 UTC
  4 points
  0
  Parent
  The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
  This sounds to me like you’re imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don’t really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn’t rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don’t just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
  The (revealed) perception of risks and benefits depends on many things, including what kind of AI is available/widespread/adopted. Perhaps we can tweak those parameters. (Not claiming that it’s going to be easy.)
  I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
  Something in this direction, yes.
  - Towards_Keeperhood 19 May 2025 12:56 UTC
    1 point
    0
    Parent
    The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
    There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
    EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)
    And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.
    - JustinShovelain 20 May 2025 8:24 UTC
      2 points
      0
      Parent
      About getting coherent corrigibility, my and Joar’s post on Updating Utility Functions, makes some progress on a soft form of corrigibility.