I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn’t rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems).
I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post).
My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do you disagree with that?
I agree that human model misspecification is a severe problem, for CIRL as well as for other reward modeling approaches. There are a couple of different ways to approach this. One is to do cognitive science research to build increasingly accurate human models, or to try to just learn them. The other is to build reward modeling systems that are robust to human model misspecification, possibly by maintaining uncertainty over possible human models, or doing something other than Bayesianism that doesn’t rely on a likelihood model. I’m more sympathetic to the latter approach, mostly because reducing human model misspecification to zero seems categorically impossible (unless we can fully simulate human minds, which has other problems).
I also share your concern about the human-evaluating-atomic-actions failure mode. Another challenge with this line of research is that it implicitly assumes a particular scale, when in reality that scale is just one point on hierarchy. For example, the CIRL paper treats “make paperclips” as an atomic action. But we could easily increase the scale (“construct and operate a paperclip factory”) or decrease it (“bend this piece of wire” or even “send a bit of information to this robot arm”). “Make paperclips” was probably chosen because it’s the most natural level of abstraction of a human, but how do we figure that out in general? I think this is an unsolved challenge for reward learning (including this post).
My claim wasn’t that CIRL itself belongs to a “near-corrigible” class, but rather that some of the non-corrigible behaviors described in the post do. (For example, R no-op’ing until it gets more information rather than immediately shutting off when told to.) This isn’t sufficient to claim that optimal R behavior in CIRL games always or even often has this type, just that it possibly does and therefore I think it’s worth figuring out whether this is a coherent behavior class or not. Do you disagree with that?
Thanks for clarifying, that makes sense.