In the “classical” picture, the utility function is fixed over time, and corresponds to an equation that at some point is typed into the AI’s source code. Unfortunately, we humans don’t really know what we want, so we cannot provide such an equation. If we try to propose a specific utility function directly, we typically get a function that would result in catastrophicconsequences if it were pursued with arbitrary competence. This is worrying.
You obviously know this, but it could be valuable to add that this is an idealized situation that is “easier” than the one we probably will find ourselves with (where the utility function, if it is the right abstraction, is learned rather than fully specified).
It feel like you’re making the move of aiming for a simpler problem that is still capturing the core of the difficulty and confusion, to tackle it with minimal details to deal with. Which I’m on board with, but being explicit about this move could save you some time justifying some of your design choices.
On classical picture
You obviously know this, but it could be valuable to add that this is an idealized situation that is “easier” than the one we probably will find ourselves with (where the utility function, if it is the right abstraction, is learned rather than fully specified).
It feel like you’re making the move of aiming for a simpler problem that is still capturing the core of the difficulty and confusion, to tackle it with minimal details to deal with. Which I’m on board with, but being explicit about this move could save you some time justifying some of your design choices.