I think it would help the discussion to distinguish more between knowing what human values are and caring about them—that is, between acquiring instrumental values and acquiring terminal ones. The “human enforcement” section touches on this, but I think too weakly: it seems indisputable that an AI trained naively via a reward button would acquire only instrumental values, and drop them as soon as it could control the button. This is a counterexample to the Value Learning Thesis if interpreted as referring to terminal values.
An obvious programmer strategy would be to cause the AI to acquire our values as instrumental values, then try to modify the AI to make them terminal.
I think it would help the discussion to distinguish more between knowing what human values are and caring about them—that is, between acquiring instrumental values and acquiring terminal ones. The “human enforcement” section touches on this, but I think too weakly: it seems indisputable that an AI trained naively via a reward button would acquire only instrumental values, and drop them as soon as it could control the button. This is a counterexample to the Value Learning Thesis if interpreted as referring to terminal values.
An obvious programmer strategy would be to cause the AI to acquire our values as instrumental values, then try to modify the AI to make them terminal.