This is a nice idea. I think it’d need alterations before it became a useful tool (if I’m understanding clearly, and not missing applications of the unaltered version), but it has potential.
[[Note: I haven’t looked in any detail at tailcalled’s comments/post, since I wanted to give my initial impressions first; apologies for any redundancy]]
Thoughts:
There’s an Anna Karenina issue: All happiness-inducing AI policies are alike; each unhappiness-inducing policy induces unhappiness in its own way. In some real-world situation, perhaps there are 1030 good AI policies and 1050 bad ones. A corrigibility measure that can be near-maximized by allowing us complete control over which of the bad policies we get (and no option to get a good policy) isn’t great. Intuitively, getting to choose among any of the good policies is much more corrigible than getting to choose among any of the bad ones—but as things stand (IIUC), CorrigibilityPM says that the former is much less corrigible than the latter. I think there needs to be some changes-we-actually-care-about weighting. Probably that looks like bringing in human utility and/or a prior over which changes the human might want.
I think we need something more continuous than a set of “cognitively accessible” policies. This should allow dealing with manipulation that’s shifts human policies to be more/less accessible. Perhaps it makes sense to put this all in terms of the expected utility cost in [finding and switching to a policy] - i.e. accessibility = 1/(expected cost). So e.g. a policy that takes much longer to find is less accessible, as is one which entails the human shooting themselves in the foot (perhaps it’s cleanest to think of the former as first picking a [search-for-a-better-policy] policy, so that there’s no need to separate out the search process).
I suppose you preferred not to involve expected utility much (??), but I think in not doing so you end up implicitly assuming indifference on many questions we strongly care about. (or rather ending up with a measure that we’d only find useful if we were indifferent on such questions)
Oh and of course your non-obstruction does much better at capturing what we care about. It’s not yet clear to me whether some adapted version of CorrigibilityPM gets at something independently useful. Maybe.
[I realize that you’re aiming to get at something different here—but so far I’m not clear on a context where I’d be interested in CorrigibilityPM as more than a curiosity]
This is a nice idea. I think it’d need alterations before it became a useful tool (if I’m understanding clearly, and not missing applications of the unaltered version), but it has potential.
[[Note: I haven’t looked in any detail at tailcalled’s comments/post, since I wanted to give my initial impressions first; apologies for any redundancy]]
Thoughts:
There’s an Anna Karenina issue: All happiness-inducing AI policies are alike; each unhappiness-inducing policy induces unhappiness in its own way.
In some real-world situation, perhaps there are 1030 good AI policies and 1050 bad ones. A corrigibility measure that can be near-maximized by allowing us complete control over which of the bad policies we get (and no option to get a good policy) isn’t great. Intuitively, getting to choose among any of the good policies is much more corrigible than getting to choose among any of the bad ones—but as things stand (IIUC), CorrigibilityPM says that the former is much less corrigible than the latter.
I think there needs to be some changes-we-actually-care-about weighting. Probably that looks like bringing in human utility and/or a prior over which changes the human might want.
I think we need something more continuous than a set of “cognitively accessible” policies. This should allow dealing with manipulation that’s shifts human policies to be more/less accessible.
Perhaps it makes sense to put this all in terms of the expected utility cost in [finding and switching to a policy] - i.e. accessibility = 1/(expected cost).
So e.g. a policy that takes much longer to find is less accessible, as is one which entails the human shooting themselves in the foot (perhaps it’s cleanest to think of the former as first picking a [search-for-a-better-policy] policy, so that there’s no need to separate out the search process).
I suppose you preferred not to involve expected utility much (??), but I think in not doing so you end up implicitly assuming indifference on many questions we strongly care about. (or rather ending up with a measure that we’d only find useful if we were indifferent on such questions)
Oh and of course your non-obstruction does much better at capturing what we care about.
It’s not yet clear to me whether some adapted version of CorrigibilityPM gets at something independently useful. Maybe.
[I realize that you’re aiming to get at something different here—but so far I’m not clear on a context where I’d be interested in CorrigibilityPM as more than a curiosity]