Some people believe that the best approach is to define “safe alignment towards any goal” as the first step, and then insert the “goal” once this is done. I tend to think that this doesn’t make sense, and we need to build the values up as we define alignment.
How would you categorize corrigibility in this framework? To me, finding ways to imbue principles of corrigibility in powerful AI systems looks like a promising “incremental” approach. The end goal of such an approach might be to use a powerful but corrigible system pivotally, but many of the properties seem valuable for safety and flourishing in other contexts.
Eliezer!corrigibility is very much not about making the corrigible system understand or maximize human values, for most values other than “not being disassembled for our atoms prematurely”. But it still feels more like a positive, incremental approach to building safe systems.
Having done a lot of work on corrigibility, I believe that it can’t be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.
How would you categorize corrigibility in this framework? To me, finding ways to imbue principles of corrigibility in powerful AI systems looks like a promising “incremental” approach. The end goal of such an approach might be to use a powerful but corrigible system pivotally, but many of the properties seem valuable for safety and flourishing in other contexts.
Eliezer!corrigibility is very much not about making the corrigible system understand or maximize human values, for most values other than “not being disassembled for our atoms prematurely”. But it still feels more like a positive, incremental approach to building safe systems.
Having done a lot of work on corrigibility, I believe that it can’t be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.