Thanks for writing this, clarifying assumptions seems very helpful for reducing miscommunications about CIRL (in)corrigibility.
Non-exhaustive list of things I agree with:
Which assumptions you make has a big impact, and making unrealistic ones leads to misleading results.
Relaxing the assumptions of the original OSG the way you do moves it much closer to being realistic (to me, it looks like the version in the OSG paper assumes away most of what makes the shutdown problem difficult).
We want something between “just maximize utility” and “check in before every minor action”, and currently don’t have a good formulation of that.
Things I’d add/change (not sure if all of these are actual disagreements):
I would focus more on the assumption that there is no model misspecification, as opposed to the rationality assumption. Assuming that H is rational and that R knows this is one especially nice way to have no model misspecification. But I think things are still mostly fine if H is irrational, as long as R knows exactly in which way H is irrational (i.e. our model p(human actions | human values) is correct, and our prior over human values has sufficient support). What causes issues in the fully updated deference limit is that this model and/or the value prior is misspecified.
I’m not convinced by the fix for irrationality you propose (ensuring that R always asks about atomic actions). Even apart from issues of scalability/human feedback being expensive, I don’t think assuming humans are rational when only evaluating small simple actions is safe. What would these small actions that are safe to evaluate look like? As a silly example, if we ask humans to evaluate bits sent over an internet connection, that doesn’t seem safe (in a regime where the AI is more competent than humans at lots of things, so that humans won’t understand the consequences of those bit strings being sent as well as the AI does).
While I agree that there’s something like a corrigibility spectrum, I don’t think (slight modifications of) CIRL belong to this “class of near-corrigible behaviors”. (Not actually sure whether you think CIRL is “near-corrigible” in some important sense or not.)
The way I think about this: what CIRL gives you is a single level of corrigibility, or “object-level corrigibility”. I.e. if you compare to just optimizing a fixed value function, a CIRL agent lets users correct its beliefs about what its objective should be. What we’d ideally want is something more like “corrigibility at arbitrary meta-levels”. I’m a bit skeptical that “fixing some problems with CIRL” is a good way to think about that problem, it feels more like we just don’t have the right tools to formulate what we want mathematically. Given that Bayesianism itself might be the problem, (Bayesian) value uncertainty might in fact be a counterproductive move in the long term. Hard to say right now IMO, but I wouldn’t just want to assume CIRL as a starting point for figuring out corrigibility.
I’m not super optimistic about the implicit plan to (1) figure out a formalism that gives us corrigibility and then (2) build an AI that uses that formalism during runtime, because (1) seems hard and (2) seems really hard using deep learning (as Rohin has said). What Paul describes here currently sounds to me like a more promising way to think about corrigibility. That said, trying more things seems good, and I would still be extremely excited about a good formulation of “corrigibility at arbitrary meta-levels” (even though I think the difficulty of step (2) would likely prevent us from applying that formulation directly to building AI systems).
I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn’t rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems).
I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post).
My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do you disagree with that?
Thanks for writing this, clarifying assumptions seems very helpful for reducing miscommunications about CIRL (in)corrigibility.
Non-exhaustive list of things I agree with:
Which assumptions you make has a big impact, and making unrealistic ones leads to misleading results.
Relaxing the assumptions of the original OSG the way you do moves it much closer to being realistic (to me, it looks like the version in the OSG paper assumes away most of what makes the shutdown problem difficult).
We want something between “just maximize utility” and “check in before every minor action”, and currently don’t have a good formulation of that.
Things I’d add/change (not sure if all of these are actual disagreements):
I would focus more on the assumption that there is no model misspecification, as opposed to the rationality assumption. Assuming that H is rational and that R knows this is one especially nice way to have no model misspecification. But I think things are still mostly fine if H is irrational, as long as R knows exactly in which way H is irrational (i.e. our model p(human actions | human values) is correct, and our prior over human values has sufficient support). What causes issues in the fully updated deference limit is that this model and/or the value prior is misspecified.
I’m not convinced by the fix for irrationality you propose (ensuring that R always asks about atomic actions). Even apart from issues of scalability/human feedback being expensive, I don’t think assuming humans are rational when only evaluating small simple actions is safe. What would these small actions that are safe to evaluate look like? As a silly example, if we ask humans to evaluate bits sent over an internet connection, that doesn’t seem safe (in a regime where the AI is more competent than humans at lots of things, so that humans won’t understand the consequences of those bit strings being sent as well as the AI does).
While I agree that there’s something like a corrigibility spectrum, I don’t think (slight modifications of) CIRL belong to this “class of near-corrigible behaviors”. (Not actually sure whether you think CIRL is “near-corrigible” in some important sense or not.)
The way I think about this: what CIRL gives you is a single level of corrigibility, or “object-level corrigibility”. I.e. if you compare to just optimizing a fixed value function, a CIRL agent lets users correct its beliefs about what its objective should be. What we’d ideally want is something more like “corrigibility at arbitrary meta-levels”. I’m a bit skeptical that “fixing some problems with CIRL” is a good way to think about that problem, it feels more like we just don’t have the right tools to formulate what we want mathematically. Given that Bayesianism itself might be the problem, (Bayesian) value uncertainty might in fact be a counterproductive move in the long term. Hard to say right now IMO, but I wouldn’t just want to assume CIRL as a starting point for figuring out corrigibility.
I’m not super optimistic about the implicit plan to (1) figure out a formalism that gives us corrigibility and then (2) build an AI that uses that formalism during runtime, because (1) seems hard and (2) seems really hard using deep learning (as Rohin has said). What Paul describes here currently sounds to me like a more promising way to think about corrigibility. That said, trying more things seems good, and I would still be extremely excited about a good formulation of “corrigibility at arbitrary meta-levels” (even though I think the difficulty of step (2) would likely prevent us from applying that formulation directly to building AI systems).
I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn’t rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems).
I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post).
My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do you disagree with that?
Thanks for clarifying, that makes sense.