My first half-baked thoughts about what sort of abstraction we might use instead of utility functions:
Maybe instead of thinking about preferences as rankings over worlds, we think of preferences as like gradients. Given the situation that an agent finds itself in, there are some directions to move in state space that it prefers and some that it disprefers. And as the agent moves through the world, and it’s situation changes, it’s preference gradients might change too.
This allows for cycles, where from a, the agent prefers b, and from b, the agent prefers c, and from c, the agent prefers a.
It also means that preferences are inherently contextual. It doesn’t make sense to ask what an agent wants in the abstract, only what it wants given some situated context. This might be a feature, not a bug, in that it resolves some puzzles about values.
This implies a sort of non-transitivity of preferences. If you can predict that you’ll want something, in the future, that doesn’t necessarily imply that you want it now.
The problem with this is also that it’s too expressive. For any policy π, you can encode that policy into this sort of gradient: if π takes action a in state s, you say that your gradient points towards a (or the state s’ that results from taking action a), and away from every other action / state.
I happen to agree with this generalization, provided we also respect the constraint “if you can predict that you’ll want something in the future, then you want it now”. (There might also be other coherence constraints I would want to impose! But this is a central one.)
On the one hand, if we violate this, we will usually prefer to self-modify to remove the violation. They might not entirely stop their preferences from changing, but they’d certainly want to change the method of change, at least. This is very much like a philosopher who doesn’t trust their own deliberation process. They might not want to entirely stop thinking (some ways of changing your mind are good), but they would want to modify their reasoning somehow.
(Furthermore, an agent who sees this kind of thing coming, but does not yet inhabit either conflicting camp, would probably want to self-modify in some way to avoid the conflict.)
On the other hand, suppose an agent passes through this kind of belief change without having an opportunity to self-modify. The agent will think its past self was wrong to want to resist the change. It will want to avoid that type of mistake in the future. If we make the assumption that learning will tend to make modifications which would have ‘helped’ its past self, then such an agent will learn to predict value changes and learn to agree with those predictions.
This gives us something similar to logical induction.
You mentioned in the article that you intuitively want some kind of “dominance” argument which dutch-books/money-pumps don’t give you. I would propose logical-induction style dominance. What you have is essentially the guarantee that someone with cognitive powers comparable to yours can’t come in and do a better job of satisfying your (future) values.
Why do we want that guarantee?
The usefulness of the current action to future preferences is what’s important for learning, since future preferences are the ones which get to decide how to modify things. So this is a notion of “doing the best we can” with respect to learning: we couldn’t benefit from the advice of someone with similar cognitive strength to us.
Relatedly, this is important for tiling agents: if (it looks to you like) a different configuration of a similar amount of processing power would do a better job, then you’d prefer to self-modify to that configuration.
My first half-baked thoughts about what sort of abstraction we might use instead of utility functions:
Maybe instead of thinking about preferences as rankings over worlds, we think of preferences as like gradients. Given the situation that an agent finds itself in, there are some directions to move in state space that it prefers and some that it disprefers. And as the agent moves through the world, and it’s situation changes, it’s preference gradients might change too.
This allows for cycles, where from a, the agent prefers b, and from b, the agent prefers c, and from c, the agent prefers a.
It also means that preferences are inherently contextual. It doesn’t make sense to ask what an agent wants in the abstract, only what it wants given some situated context. This might be a feature, not a bug, in that it resolves some puzzles about values.
This implies a sort of non-transitivity of preferences. If you can predict that you’ll want something, in the future, that doesn’t necessarily imply that you want it now.
Relaxing independence rather than transitivity is the most explored angle of attack IIRC.
The problem with this is also that it’s too expressive. For any policy π, you can encode that policy into this sort of gradient: if π takes action a in state s, you say that your gradient points towards a (or the state s’ that results from taking action a), and away from every other action / state.
I happen to agree with this generalization, provided we also respect the constraint “if you can predict that you’ll want something in the future, then you want it now”. (There might also be other coherence constraints I would want to impose! But this is a central one.)
On the one hand, if we violate this, we will usually prefer to self-modify to remove the violation. They might not entirely stop their preferences from changing, but they’d certainly want to change the method of change, at least. This is very much like a philosopher who doesn’t trust their own deliberation process. They might not want to entirely stop thinking (some ways of changing your mind are good), but they would want to modify their reasoning somehow.
(Furthermore, an agent who sees this kind of thing coming, but does not yet inhabit either conflicting camp, would probably want to self-modify in some way to avoid the conflict.)
On the other hand, suppose an agent passes through this kind of belief change without having an opportunity to self-modify. The agent will think its past self was wrong to want to resist the change. It will want to avoid that type of mistake in the future. If we make the assumption that learning will tend to make modifications which would have ‘helped’ its past self, then such an agent will learn to predict value changes and learn to agree with those predictions.
This gives us something similar to logical induction.
You mentioned in the article that you intuitively want some kind of “dominance” argument which dutch-books/money-pumps don’t give you. I would propose logical-induction style dominance. What you have is essentially the guarantee that someone with cognitive powers comparable to yours can’t come in and do a better job of satisfying your (future) values.
Why do we want that guarantee?
The usefulness of the current action to future preferences is what’s important for learning, since future preferences are the ones which get to decide how to modify things. So this is a notion of “doing the best we can” with respect to learning: we couldn’t benefit from the advice of someone with similar cognitive strength to us.
Relatedly, this is important for tiling agents: if (it looks to you like) a different configuration of a similar amount of processing power would do a better job, then you’d prefer to self-modify to that configuration.