I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.
I wasn’t making a definitive statement on what I think people mean when they say “corrigibility”, to be clear. The point I was making is that any implementation of corrigibility that I think is worth trying for necessarily has the “faithfulness” component — i. e., the AI would have to interpret its values/tasks/orders the way they were intended by the order-giver, instead of some other way. Which, in turn, likely requires somehow making it locate humans in its world-model (though likely implemented as “locate the model of [whoever is giving me the order]” in the AI’s utility function, not necessarily referring to [humans] specifically).
And building off that definition, if “value learning” is supposed to mean something different, then I’d define it as pointing at human values not through humans, but directly. I. e., making the AI value the same things that humans value not because it knows that it’s what humans value, but just because.
Again, I don’t necessarily think that it’s what most people mean by these terms most times — I would natively view both approaches to this as something like “value learning” as well. But this discussion started from John (1) differentiating between them, and (2) viewing both approaches as viable. This is just how I’d carve it under these two constraints.