This is very fine provided you know which part of the AI’s code contains the utility function, and are certain it’s not going to be modified. But it seems to me that if you were able to calculate the utility of world-outcomes modularly, then you wouldn’t need an AI in the first place; you would instead build an Oracle, give it your possible actions as input, and select the action with the greatest utility. Consequently, if you have an AI, it is because your utility calculation is not a separable piece of code, but some sort of global function of a huge number of inputs and internal calculations. How can you apply a filter to that?
You’ve assumed away the major difficulty, that of knowing what the AI’s utility function is in the first place! If you can simply inspect the utility function like this, there’s no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don’t want.
If you know the utility function, you have no need to filter it. If you don’t know it, you can’t filter it.
But it seems to me that if you were able to calculate the utility of
world-outcomes modularly, then you wouldn’t need an AI in the first
place; you would instead build an Oracle, give it your possible actions
as input, and select the action with the greatest utility.
That sounds as though it is just an intelligent machine which has been crippled by being forced to act through a human body.
You’ve assumed away the major difficulty, that of knowing what the AI’s utility function is in the first place! If you can simply inspect the utility function like this, there’s no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don’t want.
Knowing what U is, and figuring out if U will result in outcomes that you like, are completely different things! We have little grasp of the space of possible outcomes; we don’t even know what we want, and we can’t imagine some of the things that we don’t want.
Yes, we do need to have some idea of what U is—or at least something (a simple AI subroutine applying the filter, an AI designing its next self-improvement) has to have some idea. But it doesn’t need to understand U beyond what is needed to apply F. And since F is considerably simpler than what U is likely to be...
It seems plausible that F could be implemented by a simple subroutine even across self-improvement.
This is very fine provided you know which part of the AI’s code contains the utility function, and are certain it’s not going to be modified. But it seems to me that if you were able to calculate the utility of world-outcomes modularly, then you wouldn’t need an AI in the first place; you would instead build an Oracle, give it your possible actions as input, and select the action with the greatest utility. Consequently, if you have an AI, it is because your utility calculation is not a separable piece of code, but some sort of global function of a huge number of inputs and internal calculations. How can you apply a filter to that?
You’ve assumed away the major difficulty, that of knowing what the AI’s utility function is in the first place! If you can simply inspect the utility function like this, there’s no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don’t want.
If you know the utility function, you have no need to filter it. If you don’t know it, you can’t filter it.
That sounds as though it is just an intelligent machine which has been crippled by being forced to act through a human body.
You suggest that would be better—but how?
Good comment.
Knowing what U is, and figuring out if U will result in outcomes that you like, are completely different things! We have little grasp of the space of possible outcomes; we don’t even know what we want, and we can’t imagine some of the things that we don’t want.
Yes, we do need to have some idea of what U is—or at least something (a simple AI subroutine applying the filter, an AI designing its next self-improvement) has to have some idea. But it doesn’t need to understand U beyond what is needed to apply F. And since F is considerably simpler than what U is likely to be...
It seems plausible that F could be implemented by a simple subroutine even across self-improvement.