The basic idea that the utility should be learned is right, though perhaps one can still build a wrapper for it at some efficiency cost, rather than integrating the two modules.
However, I think the post spirals into unfounded optimism about the consequences of this observation. When the utility function “breaks” that probably looks like collapse to preferring something we didn’t intend, perhaps because the planner has pushed the utility function out of distribution as you described. At least under the wrapper design, it should never look like incoherent action—Bayes optimal decisions with respect to ANY utility function are coherent—so the agent would presumably continue to function but seeking some bizarre goal. Perhaps some utility functions are sufficiently discontinuous that the agent really does start flipping out like you suggest, but this need not be the case. As an existence proof, a broken utility function that collapses to depending directly on the sensor inputs instead of their correlates in the world is perfectly tractable and there is a consistent stable agent which optimizes for it at least in theory, namely AIXI (with appropriate reward function).
The basic idea that the utility should be learned is right, though perhaps one can still build a wrapper for it at some efficiency cost, rather than integrating the two modules.
However, I think the post spirals into unfounded optimism about the consequences of this observation. When the utility function “breaks” that probably looks like collapse to preferring something we didn’t intend, perhaps because the planner has pushed the utility function out of distribution as you described. At least under the wrapper design, it should never look like incoherent action—Bayes optimal decisions with respect to ANY utility function are coherent—so the agent would presumably continue to function but seeking some bizarre goal. Perhaps some utility functions are sufficiently discontinuous that the agent really does start flipping out like you suggest, but this need not be the case. As an existence proof, a broken utility function that collapses to depending directly on the sensor inputs instead of their correlates in the world is perfectly tractable and there is a consistent stable agent which optimizes for it at least in theory, namely AIXI (with appropriate reward function).