Yeah, I agree that the mistake model implied by your proposal isn’t correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.
Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the “mistake model” way of thinking. I use the “mistake model” way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you’re relying on in your alignment proposal.
Of course this is all assuming that there does exist a true utility function, but I think we can replace “true utility function” with “utility function that encodes the optimal actions to take for the best possible universe” and everything still follows through. But of course, not hitting this target just means that we don’t do the perfectly optimal thing—it’s totally possible that we end up doing something that is only very slightly suboptimal.