Which is to say, it’s possible for a scheme like this to not even include good results in its hypothesis space.
I think I agree strongly with the spirit of that, but would frame it differently. I’d say that this approach could fail to consider extremely important dimensions over which the agent has value. Indeed, all examples in the post very likely do so. Alice probably also cares about other things, such as safety, cost, fuel efficiency, etc. So it could be that the ideal car would have a sportiness of 50 and a weight of 6 (or whatever), but also that the vast majority of such cars would be disliked, because the ideal car has that as well as some very specific scores for various other dimensions.
So I think that the problem is really that this approach could very easily fail to sufficiently specify the states, so that, even if the good result is included in its hypothesis space, achieving “that good result” is actually still a quite wide target and many ways of doing so would be bad.
As for the specific issue you (Charlie Steiner, not FactorialCode) raise, I think there are multiple possible workarounds. One is what FactorialCode suggests. Another would be to add another dimension to capture “similarity to what has come before and will come after”, or “variety compared to what the agent is used to”, or “amount of change within this state”, or something like that.
Note that, as stated above, any dimensions that aren’t specified within the state space are free to vary, with this still counting as the “same state”. So if the agent wants “variety” for its own sake, and is fine with this being variety along dimensions it doesn’t otherwise care about, that could be fairly easy to accommodate. It would get trickier if the agent wants variety along dimensions it does otherwise care about, but it still seems some workarounds could work.
But this is of course just patching one specific issue, and I still think that the fact we could easily neglect important dimensions matters a lot. But I think that’s important for any efforts towards alignment or capturing preferences.
I think I agree strongly with the spirit of that, but would frame it differently. I’d say that this approach could fail to consider extremely important dimensions over which the agent has value. Indeed, all examples in the post very likely do so. Alice probably also cares about other things, such as safety, cost, fuel efficiency, etc. So it could be that the ideal car would have a sportiness of 50 and a weight of 6 (or whatever), but also that the vast majority of such cars would be disliked, because the ideal car has that as well as some very specific scores for various other dimensions.
So I think that the problem is really that this approach could very easily fail to sufficiently specify the states, so that, even if the good result is included in its hypothesis space, achieving “that good result” is actually still a quite wide target and many ways of doing so would be bad.
This seems very related to the idea that value is fragile.
As for the specific issue you (Charlie Steiner, not FactorialCode) raise, I think there are multiple possible workarounds. One is what FactorialCode suggests. Another would be to add another dimension to capture “similarity to what has come before and will come after”, or “variety compared to what the agent is used to”, or “amount of change within this state”, or something like that.
Note that, as stated above, any dimensions that aren’t specified within the state space are free to vary, with this still counting as the “same state”. So if the agent wants “variety” for its own sake, and is fine with this being variety along dimensions it doesn’t otherwise care about, that could be fairly easy to accommodate. It would get trickier if the agent wants variety along dimensions it does otherwise care about, but it still seems some workarounds could work.
But this is of course just patching one specific issue, and I still think that the fact we could easily neglect important dimensions matters a lot. But I think that’s important for any efforts towards alignment or capturing preferences.