I had in mind the setting where no particular assumption is made about the observability of utility at all, like in classic settings such as Savage etc. I’m not claiming it is very relevant to the discussion at hand. It’s just that there is some notion of rationality to be studied in such cases. I also misspoke in “dire situation for learning theory”—I don’t want to prematurely narrow what learning theory can fruitfully discuss. Something more like “dire situation for reinforcement learning” would have been more accurate.
By the way, from the perspective of what I call “subjective” learnability, we can consider this case as well. In subjective learnability, we define regret with respect to an optimal agent that has all of the user’s knowledge (rather than knowing the actual “true” environment). So, it seems not difficult to imagine communication protocols that will allow the AI learn the user’s knowledge including the unobservable reward function (we assume the user imagines the environment as a POMDP with rewards that are not directly observable but defined as a function of the state). (Incidentally, recently I am leaning towards the position that some of kind of generic communication protocol (with natural language “annotation” + quantilization to avoid detrimental messages) is the best approach to achieve alignment.)
By the way, from the perspective of what I call “subjective” learnability, we can consider this case as well. In subjective learnability, we define regret with respect to an optimal agent that has all of the user’s knowledge (rather than knowing the actual “true” environment). So, it seems not difficult to imagine communication protocols that will allow the AI learn the user’s knowledge including the unobservable reward function (we assume the user imagines the environment as a POMDP with rewards that are not directly observable but defined as a function of the state). (Incidentally, recently I am leaning towards the position that some of kind of generic communication protocol (with natural language “annotation” + quantilization to avoid detrimental messages) is the best approach to achieve alignment.)