Sure. But same as DanielV’s point about elimiating circularity by just asking for the complete preference ordering, we are limited by what humans can think about.
Humans have to think in terms of high-level descriptions of approximately constant size, no matter the spatiotemporal scale. We literally cannot elicit preferences over universe-histories, much as we’d like to.
What we can do, maybe, is elicit some opinions on these “high-level descriptions of approximately constant size,” at many different spatiotemporal scales: ranging from general opinions on how the universe should go to what could improve the decor of your room today. Stitching these together into a utility function over universe histories is pretty tricky, but I think there might be some illuminating simplifications we could think about.
Which is to say, it’s possible for a scheme like this to not even include good results in its hypothesis space.
I think I agree strongly with the spirit of that, but would frame it differently. I’d say that this approach could fail to consider extremely important dimensions over which the agent has value. Indeed, all examples in the post very likely do so. Alice probably also cares about other things, such as safety, cost, fuel efficiency, etc. So it could be that the ideal car would have a sportiness of 50 and a weight of 6 (or whatever), but also that the vast majority of such cars would be disliked, because the ideal car has that as well as some very specific scores for various other dimensions.
So I think that the problem is really that this approach could very easily fail to sufficiently specify the states, so that, even if the good result is included in its hypothesis space, achieving “that good result” is actually still a quite wide target and many ways of doing so would be bad.
As for the specific issue you (Charlie Steiner, not FactorialCode) raise, I think there are multiple possible workarounds. One is what FactorialCode suggests. Another would be to add another dimension to capture “similarity to what has come before and will come after”, or “variety compared to what the agent is used to”, or “amount of change within this state”, or something like that.
Note that, as stated above, any dimensions that aren’t specified within the state space are free to vary, with this still counting as the “same state”. So if the agent wants “variety” for its own sake, and is fine with this being variety along dimensions it doesn’t otherwise care about, that could be fairly easy to accommodate. It would get trickier if the agent wants variety along dimensions it does otherwise care about, but it still seems some workarounds could work.
But this is of course just patching one specific issue, and I still think that the fact we could easily neglect important dimensions matters a lot. But I think that’s important for any efforts towards alignment or capturing preferences.
You can work around this by making your “state space” descriptions of sequences of states. And defining preferences between these sequences.
Sure. But same as DanielV’s point about elimiating circularity by just asking for the complete preference ordering, we are limited by what humans can think about.
Humans have to think in terms of high-level descriptions of approximately constant size, no matter the spatiotemporal scale. We literally cannot elicit preferences over universe-histories, much as we’d like to.
What we can do, maybe, is elicit some opinions on these “high-level descriptions of approximately constant size,” at many different spatiotemporal scales: ranging from general opinions on how the universe should go to what could improve the decor of your room today. Stitching these together into a utility function over universe histories is pretty tricky, but I think there might be some illuminating simplifications we could think about.
I think I agree strongly with the spirit of that, but would frame it differently. I’d say that this approach could fail to consider extremely important dimensions over which the agent has value. Indeed, all examples in the post very likely do so. Alice probably also cares about other things, such as safety, cost, fuel efficiency, etc. So it could be that the ideal car would have a sportiness of 50 and a weight of 6 (or whatever), but also that the vast majority of such cars would be disliked, because the ideal car has that as well as some very specific scores for various other dimensions.
So I think that the problem is really that this approach could very easily fail to sufficiently specify the states, so that, even if the good result is included in its hypothesis space, achieving “that good result” is actually still a quite wide target and many ways of doing so would be bad.
This seems very related to the idea that value is fragile.
As for the specific issue you (Charlie Steiner, not FactorialCode) raise, I think there are multiple possible workarounds. One is what FactorialCode suggests. Another would be to add another dimension to capture “similarity to what has come before and will come after”, or “variety compared to what the agent is used to”, or “amount of change within this state”, or something like that.
Note that, as stated above, any dimensions that aren’t specified within the state space are free to vary, with this still counting as the “same state”. So if the agent wants “variety” for its own sake, and is fine with this being variety along dimensions it doesn’t otherwise care about, that could be fairly easy to accommodate. It would get trickier if the agent wants variety along dimensions it does otherwise care about, but it still seems some workarounds could work.
But this is of course just patching one specific issue, and I still think that the fact we could easily neglect important dimensions matters a lot. But I think that’s important for any efforts towards alignment or capturing preferences.