A friend (correctly) recommended me this post as useful context and I’m documenting my thoughts here for easy reference. This is not, strictly speaking, an objection to the headline claim of the post. It’s a claim that coherence will tend to emerge in practice.
That the agent knows in advance what trades they will be offered.
This assumption doesn’t hold in real life. It’s a bit like saying “If I know what moves my opponent will make, I can always beat them at chess.” Well, yes. But in practice you don’t. Agents in real life can’t rely on perfect knowledge like this. Directionally, agents will be less exploitable and more efficient as their preferences grow more explicit and coherent. In actual practice, training a neural net to solve problems without getting stuck also trains it to have more explicit and coherent preferences.
(If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.)
I blame it on both. The lack of knowledge in question is the fact that agents in practice aren’t omni-prescient. The lack of reasoning ability in question is a refusal to assign an explicit preference ordering to outcomes.
If you don’t know the whole decision tree in advance, then “if I previously turned down some option X, I will not choose any option that I strictly disprefer to X” will probably be violated at some point by e.g. having rejected X1 and X2 and later having to choose between X1- and X2-, even without adversarial exploitation.
Even if I grant the entire rest of the post, it still seems highly probable that sufficiently smart AIs grown using modern methods end up likely to have coherent preference orderings in most ways that matter.
Somewhat relatedly, “If I previously turned down some option X, I will not choose any option that I strictly disprefer to X” does feel to me like a grafted-on hack of a policy that breaks down in some adversarial edge case.
Maybe it’s airtight, I’m not sure. But if it is, that just feels like coherence with extra steps? Like, sure, you can pursue a strategy of incoherence which requires you to know the entire universe of possible trades you will make and then backchains inductively to make sure you never, ever are exploitable about this.
Or you could make your preferences explicit and be consistent in the first place. In a sense, I think that’s the simple, elegant thing that the weird hack approximates.
If you have coherent preferences, you get the hack for free. I think an agent with coherent preferences performs at least as well with the same assumptions (prescience, backchaining) on the same decision tree, and performs better if you relax one or more of those assumptions.
In practice, it pays to be the sort of entity that attempts to have consistent preferences about things whenever that’s decision-relevant and computationally tractable.
A friend (correctly) recommended me this post as useful context and I’m documenting my thoughts here for easy reference. This is not, strictly speaking, an objection to the headline claim of the post. It’s a claim that coherence will tend to emerge in practice.
This assumption doesn’t hold in real life. It’s a bit like saying “If I know what moves my opponent will make, I can always beat them at chess.” Well, yes. But in practice you don’t. Agents in real life can’t rely on perfect knowledge like this. Directionally, agents will be less exploitable and more efficient as their preferences grow more explicit and coherent. In actual practice, training a neural net to solve problems without getting stuck also trains it to have more explicit and coherent preferences.
I blame it on both. The lack of knowledge in question is the fact that agents in practice aren’t omni-prescient. The lack of reasoning ability in question is a refusal to assign an explicit preference ordering to outcomes.
If you don’t know the whole decision tree in advance, then “if I previously turned down some option X, I will not choose any option that I strictly disprefer to X” will probably be violated at some point by e.g. having rejected X1 and X2 and later having to choose between X1- and X2-, even without adversarial exploitation.
Even if I grant the entire rest of the post, it still seems highly probable that sufficiently smart AIs grown using modern methods end up likely to have coherent preference orderings in most ways that matter.
Somewhat relatedly, “If I previously turned down some option X, I will not choose any option that I strictly disprefer to X” does feel to me like a grafted-on hack of a policy that breaks down in some adversarial edge case.
Maybe it’s airtight, I’m not sure. But if it is, that just feels like coherence with extra steps? Like, sure, you can pursue a strategy of incoherence which requires you to know the entire universe of possible trades you will make and then backchains inductively to make sure you never, ever are exploitable about this.
Or you could make your preferences explicit and be consistent in the first place. In a sense, I think that’s the simple, elegant thing that the weird hack approximates.
If you have coherent preferences, you get the hack for free. I think an agent with coherent preferences performs at least as well with the same assumptions (prescience, backchaining) on the same decision tree, and performs better if you relax one or more of those assumptions.
In practice, it pays to be the sort of entity that attempts to have consistent preferences about things whenever that’s decision-relevant and computationally tractable.