You’ve assumed away the major difficulty, that of knowing what the AI’s utility function is in the first place! If you can simply inspect the utility function like this, there’s no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don’t want.
Knowing what U is, and figuring out if U will result in outcomes that you like, are completely different things! We have little grasp of the space of possible outcomes; we don’t even know what we want, and we can’t imagine some of the things that we don’t want.
Yes, we do need to have some idea of what U is—or at least something (a simple AI subroutine applying the filter, an AI designing its next self-improvement) has to have some idea. But it doesn’t need to understand U beyond what is needed to apply F. And since F is considerably simpler than what U is likely to be...
It seems plausible that F could be implemented by a simple subroutine even across self-improvement.
Good comment.
Knowing what U is, and figuring out if U will result in outcomes that you like, are completely different things! We have little grasp of the space of possible outcomes; we don’t even know what we want, and we can’t imagine some of the things that we don’t want.
Yes, we do need to have some idea of what U is—or at least something (a simple AI subroutine applying the filter, an AI designing its next self-improvement) has to have some idea. But it doesn’t need to understand U beyond what is needed to apply F. And since F is considerably simpler than what U is likely to be...
It seems plausible that F could be implemented by a simple subroutine even across self-improvement.