The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
This sounds to me like you’re imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don’t really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn’t rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don’t just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
The (revealed) perception of risks and benefits depends on many things, including what kind of AI is available/widespread/adopted. Perhaps we can tweak those parameters. (Not claiming that it’s going to be easy.)
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.
The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.) EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)
And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.
The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
The (revealed) perception of risks and benefits depends on many things, including what kind of AI is available/widespread/adopted. Perhaps we can tweak those parameters. (Not claiming that it’s going to be easy.)
Something in this direction, yes.
There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)
And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.
About getting coherent corrigibility, my and Joar’s post on Updating Utility Functions, makes some progress on a soft form of corrigibility.