Steven Byrnes’s position, if we understand it correctly, is that the AI should learn to behave in non-dangerous seeming ways[2].…
This seems a sensible approach. But it has a crucial flaw. The AI must behave in typical ways to achieve typical consequences—but how it achieves these consequences doesn’t have to be typical.
I think that problem may be solvable. See Section 1.1 (and especially 1.1.3) of this post. Here are two oversimplified implementations:
The human grades the AI’s actions as norm-following and helpful, and we do RL to make the AI score well on that metric.
The AI learn the human concept of “being norm-following and helpful” (from watching YouTube or whatever). Then it makes plans and take actions only when they pattern-match to that abstract concept. The pattern-matching part is inside the AI, looking at the whole plan as the AI itself understands it.
I was thinking of #2, not #1.
It seems like you’re assuming #1. And also assuming that the result of #1 will be an AI that is “trying” to maximize the human grades / reward. Then you conclude that the AI may learn to appear to be norm-following and helpful, instead of learning to be actually norm-following and helpful. (It could also just hack into the reward channel, for that matter.) Yup, those do seem like things that would probably happen in the absence of other techniques like interpretability. I do have a shred of hope that, if we had much better interpretability than we do today, approach #1 might be salvageable, for reasons implicit in here and here. But still, I would start at #2 not #1.
I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)
I think that problem may be solvable. See Section 1.1 (and especially 1.1.3) of this post. Here are two oversimplified implementations:
The human grades the AI’s actions as norm-following and helpful, and we do RL to make the AI score well on that metric.
The AI learn the human concept of “being norm-following and helpful” (from watching YouTube or whatever). Then it makes plans and take actions only when they pattern-match to that abstract concept. The pattern-matching part is inside the AI, looking at the whole plan as the AI itself understands it.
I was thinking of #2, not #1.
It seems like you’re assuming #1. And also assuming that the result of #1 will be an AI that is “trying” to maximize the human grades / reward. Then you conclude that the AI may learn to appear to be norm-following and helpful, instead of learning to be actually norm-following and helpful. (It could also just hack into the reward channel, for that matter.) Yup, those do seem like things that would probably happen in the absence of other techniques like interpretability. I do have a shred of hope that, if we had much better interpretability than we do today, approach #1 might be salvageable, for reasons implicit in here and here. But still, I would start at #2 not #1.
I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)
I’d like to see them. I’ll wait for the final (posted) versions, I think.