Towards_Keeperhood comments on Bounded AI might be viable

Towards_Keeperhood 19 May 2025 12:56 UTC
1 point
0
The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)
And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.
- JustinShovelain 20 May 2025 8:24 UTC
  2 points
  0
  Parent
  About getting coherent corrigibility, my and Joar’s post on Updating Utility Functions, makes some progress on a soft form of corrigibility.