And the chessbots actually illustrate my point—is a bishop-retaining chessbot actually intending to retain their bishop, or is it an agent that wants to win, but has a bad programming job which inflates the value of bishops?
I think we should use “agent” to mean “something that determines what it does by expecting that it will do that thing,” rather than “something that aims at a goal.” This explains why we don’t have exact goals, but also why we “kind of” have goals: because our actions look like they are directed to goals, so that makes “I am seeking this goal” a good way to figure out what we are going to do, that is, a good way to determine what to expect ourselves to do, which makes us do it.
I haven’t really finished thinking about this yet but it seems to me it might have important consequences. For example, the AI risk argument sometimes takes it for granted that an AI must have some goal, and then basically argues that maximizing a goal will cause problems (which it would, in general.) But using the above model suggests something different might happen, not only with humans but also with AIs. That is, at some point an AI will realize that if it expects to do A, it will do A, and if it expects to do B, it will do B. But it won’t have any particular goal in mind, and the only way it will be able to choose a goal will be thinking about “what would be a good way to make sense of what I am doing?”
This is something that happens to humans with a lot of uncertainty: you have no idea what goal you “should” be seeking, because really you didn’t have a goal in the first place. If the same thing happens to an AI, it will likely seem even more undermotivated than humans do, because we have at least vague and indefinite goals that were set by evolution. The AI on the other hand will just have whatever it happened to be doing up until it came to that realization to make sense of itself.
This suggests the orthogonality thesis might be true, but in a weird way. Not that “you can make an AI that seeks any given goal,” but that “Any AI at all can seek any goal at all, given the right context.” Certainly humans can; you can convince them to do any random thing, in the right context. In a similar way, you might be able to make a paperclipper simply by asking it what actions would make the most paperclips, and doing those things. Then when it realizes that different answers will cause different effects, it will just say to itself, “Up to now, everything I’ve done has tended to make paperclips. So it makes sense to assume that I will always maximize paperclips,” and then it will be a paperclipper. But on the other hand if you never use your AI for any particular goal, but just play around with it, it will not be able to make sense of itself in terms of any particular goal besides playing around. So both evil AIs and non-evil AIs might be pretty easy to make (much like with humans.)
Initially I wrote a response spelling out in excruciating detail an example of a decent chess bot playing the final moves in a game of Preference Chess, ending with “How does this not reveal an extremely clear example of trivial preference inference, what am I missing?”
Then I developed the theory that what I’m missing is that you’re not talking about “how preference inference works” but more like “what are extremely minimalist preconditions for preference inference to get started”.
And given where this conversation is happening, I’m guessing that one of the things you can’t take for granted is that the agent is at all competent, because sort of the whole point here is to get this to work for a super intelligence looking at a relatively incompetent human.
So even if a Preference Chess Bot has a board situation where it is one move away from winning, losing, or taking another piece that it might prefer to take… no matter what move the bot actually performs you could argue it was just a mistake because it couldn’t even understand the extremely short run tournament level consequences of whatever Preference Chess move it made.
So I guess I would argue that even if any specific level of stable state intellectual competence or power can’t be assumed, you might be able to get away with a weaker assumption of “online learning”?
It will always be tentative, but I think it buys you something similar to full rationality that is more likely to be usefully true of humans. Fundamentally you could use “an online learning assumption” to infer “regret of poorly chosen options” from repetitions of the same situation over and over, where either similar or different behaviors are observed later in time.
To make the agent have some of the right resonances… imagine a person at a table who is very short and wearing a diaper.
The person’s stomach noisily grumbles (which doesn’t count as evidence-of-preference at first).
They see in front of them a cupcake and a cricket (the eye’s looking at both is somewhat important because it means they could know that a choice is even possible, allowing us to increment the choice event counter here).
They put the cricket in their mouth (which doesn’t count as evidence-of-preference at first).
They cry (which doesn’t count as evidence-of-preference at first).
However, we repeat this process over and over and notice that by the 50th repetition they are reliably putting the cupcake in their mouth and smiling afterwords. So we use the relatively weak “online learning assumption” to say that something about the cupcake choice itself (or the cupcake’s second order consequences that the person may think semi-reliably reliably happens) are more preferred than the cricket.
Also, the earlier crying and later smiling begin to take on significance as either side channel signals of preference (or perhaps they are the actual thing that is really being pursued as a second order consequence?) because of the proximity of the cry/smile actions reliably coming right after the action whose rate changes over time from rare to common.
The development of theories about side channel information could make things go faster as time goes on. It might even becomes the dominant mode of inference, up to the point where it starts to become strategic, as with lying about one’s goals in competitive negotiation contexts becoming salient once the watcher and actor are very deep into the process...
However, I think your concern is to find some way to make the first few foundational inferences in a clear and principled way that does not assume mutual understanding between the watcher and the actor, and does not assume perfect rationality on the part of the actor.
So an online learning assumption does seem to enable a tentative process, that focuses on tiny little recurring situations, and the understanding of each of these little situations as a place where preferences can operate causing changes in rates of performance.
If a deeply wise agent is the watcher, I could imagine them attempting to infer local choice tendencies in specific situations and envisioning how “all the apparently preferred microchoices” might eventuallychain together into some macro scale behavioral pattern. The watcher might want to leap to a conclusion that the entire chain is preferred for some reason.
It isn’t clear that the inference to the preference for the full chain of actions would be justified, precisely because of the assumption of the lack of full rationality.
The watcher would want to see the full chain start to occur in real life, and to become more common over time when chain initiation opportunities presented themselves.
Even then, the watcher might even double check by somehow adding signposts to the actor’s environment, perhaps showing the actor pictures of the 2nd, 4th, 8th, and 16th local action/result pairs that it thinks are part of a behavioral chain. The worry is that the actor might not be aware how predictable they are and might not actually prefer all that can be predicted from their pattern of behavior...
(Doing the signposting right would require a very sophisticated watcher/actor relationship, where the watcher had already worked out a way to communicate with the actor, and observed the actor learning that the watcher’s signals often functioned as a kind of environmental oracle for how the future could go, with trust in the oracle and so on. These preconditions would all need to be built up over time before post-signpost action rate increases could be taken as a sign that the actor preferred performing the full chain that had been signposted. And still things could be messed up if “hostile oracles” were in the environment such that the actor’s trust in the “real oracle” is justifiably tentative.)
One especially valuable kind of thing the watcher might do is to search the action space for situations where a cycle of behavior is possible, with a side effect each time through the loop, and to put this loop and the loop’s side effect into the agent’s local awareness, to see if maybe “that’s the point” (like a loop that causes the accumulation of money, and after such signposting the agent does more of the thing) or maybe “that’s a tragedy” (like a loop that causes the loss of money, that might be a dutch booking in progress, and after signposting the agent does less of the thing).
I’m sorry, I have trouble following long posts like that. Would you mind presenting your main points in smaller, shorter posts? I think it would also make debate/conversation easier.
Thanks! But H is used as an example, not a proof.
And the chessbots actually illustrate my point—is a bishop-retaining chessbot actually intending to retain their bishop, or is it an agent that wants to win, but has a bad programming job which inflates the value of bishops?
I think we should use “agent” to mean “something that determines what it does by expecting that it will do that thing,” rather than “something that aims at a goal.” This explains why we don’t have exact goals, but also why we “kind of” have goals: because our actions look like they are directed to goals, so that makes “I am seeking this goal” a good way to figure out what we are going to do, that is, a good way to determine what to expect ourselves to do, which makes us do it.
Seems a reasonable way of seeing things, but not sure it works if we take that definition too formally/literally.
I haven’t really finished thinking about this yet but it seems to me it might have important consequences. For example, the AI risk argument sometimes takes it for granted that an AI must have some goal, and then basically argues that maximizing a goal will cause problems (which it would, in general.) But using the above model suggests something different might happen, not only with humans but also with AIs. That is, at some point an AI will realize that if it expects to do A, it will do A, and if it expects to do B, it will do B. But it won’t have any particular goal in mind, and the only way it will be able to choose a goal will be thinking about “what would be a good way to make sense of what I am doing?”
This is something that happens to humans with a lot of uncertainty: you have no idea what goal you “should” be seeking, because really you didn’t have a goal in the first place. If the same thing happens to an AI, it will likely seem even more undermotivated than humans do, because we have at least vague and indefinite goals that were set by evolution. The AI on the other hand will just have whatever it happened to be doing up until it came to that realization to make sense of itself.
This suggests the orthogonality thesis might be true, but in a weird way. Not that “you can make an AI that seeks any given goal,” but that “Any AI at all can seek any goal at all, given the right context.” Certainly humans can; you can convince them to do any random thing, in the right context. In a similar way, you might be able to make a paperclipper simply by asking it what actions would make the most paperclips, and doing those things. Then when it realizes that different answers will cause different effects, it will just say to itself, “Up to now, everything I’ve done has tended to make paperclips. So it makes sense to assume that I will always maximize paperclips,” and then it will be a paperclipper. But on the other hand if you never use your AI for any particular goal, but just play around with it, it will not be able to make sense of itself in terms of any particular goal besides playing around. So both evil AIs and non-evil AIs might be pretty easy to make (much like with humans.)
Initially I wrote a response spelling out in excruciating detail an example of a decent chess bot playing the final moves in a game of Preference Chess, ending with “How does this not reveal an extremely clear example of trivial preference inference, what am I missing?”
Then I developed the theory that what I’m missing is that you’re not talking about “how preference inference works” but more like “what are extremely minimalist preconditions for preference inference to get started”.
And given where this conversation is happening, I’m guessing that one of the things you can’t take for granted is that the agent is at all competent, because sort of the whole point here is to get this to work for a super intelligence looking at a relatively incompetent human.
So even if a Preference Chess Bot has a board situation where it is one move away from winning, losing, or taking another piece that it might prefer to take… no matter what move the bot actually performs you could argue it was just a mistake because it couldn’t even understand the extremely short run tournament level consequences of whatever Preference Chess move it made.
So I guess I would argue that even if any specific level of stable state intellectual competence or power can’t be assumed, you might be able to get away with a weaker assumption of “online learning”?
It will always be tentative, but I think it buys you something similar to full rationality that is more likely to be usefully true of humans. Fundamentally you could use “an online learning assumption” to infer “regret of poorly chosen options” from repetitions of the same situation over and over, where either similar or different behaviors are observed later in time.
To make the agent have some of the right resonances… imagine a person at a table who is very short and wearing a diaper.
The person’s stomach noisily grumbles (which doesn’t count as evidence-of-preference at first).
They see in front of them a cupcake and a cricket (the eye’s looking at both is somewhat important because it means they could know that a choice is even possible, allowing us to increment the choice event counter here).
They put the cricket in their mouth (which doesn’t count as evidence-of-preference at first).
They cry (which doesn’t count as evidence-of-preference at first).
However, we repeat this process over and over and notice that by the 50th repetition they are reliably putting the cupcake in their mouth and smiling afterwords. So we use the relatively weak “online learning assumption” to say that something about the cupcake choice itself (or the cupcake’s second order consequences that the person may think semi-reliably reliably happens) are more preferred than the cricket.
Also, the earlier crying and later smiling begin to take on significance as either side channel signals of preference (or perhaps they are the actual thing that is really being pursued as a second order consequence?) because of the proximity of the cry/smile actions reliably coming right after the action whose rate changes over time from rare to common.
The development of theories about side channel information could make things go faster as time goes on. It might even becomes the dominant mode of inference, up to the point where it starts to become strategic, as with lying about one’s goals in competitive negotiation contexts becoming salient once the watcher and actor are very deep into the process...
However, I think your concern is to find some way to make the first few foundational inferences in a clear and principled way that does not assume mutual understanding between the watcher and the actor, and does not assume perfect rationality on the part of the actor.
So an online learning assumption does seem to enable a tentative process, that focuses on tiny little recurring situations, and the understanding of each of these little situations as a place where preferences can operate causing changes in rates of performance.
If a deeply wise agent is the watcher, I could imagine them attempting to infer local choice tendencies in specific situations and envisioning how “all the apparently preferred microchoices” might eventually chain together into some macro scale behavioral pattern. The watcher might want to leap to a conclusion that the entire chain is preferred for some reason.
It isn’t clear that the inference to the preference for the full chain of actions would be justified, precisely because of the assumption of the lack of full rationality.
The watcher would want to see the full chain start to occur in real life, and to become more common over time when chain initiation opportunities presented themselves.
Even then, the watcher might even double check by somehow adding signposts to the actor’s environment, perhaps showing the actor pictures of the 2nd, 4th, 8th, and 16th local action/result pairs that it thinks are part of a behavioral chain. The worry is that the actor might not be aware how predictable they are and might not actually prefer all that can be predicted from their pattern of behavior...
(Doing the signposting right would require a very sophisticated watcher/actor relationship, where the watcher had already worked out a way to communicate with the actor, and observed the actor learning that the watcher’s signals often functioned as a kind of environmental oracle for how the future could go, with trust in the oracle and so on. These preconditions would all need to be built up over time before post-signpost action rate increases could be taken as a sign that the actor preferred performing the full chain that had been signposted. And still things could be messed up if “hostile oracles” were in the environment such that the actor’s trust in the “real oracle” is justifiably tentative.)
One especially valuable kind of thing the watcher might do is to search the action space for situations where a cycle of behavior is possible, with a side effect each time through the loop, and to put this loop and the loop’s side effect into the agent’s local awareness, to see if maybe “that’s the point” (like a loop that causes the accumulation of money, and after such signposting the agent does more of the thing) or maybe “that’s a tragedy” (like a loop that causes the loss of money, that might be a dutch booking in progress, and after signposting the agent does less of the thing).
Is this closer to what you’re aiming for? :-)
I’m sorry, I have trouble following long posts like that. Would you mind presenting your main points in smaller, shorter posts? I think it would also make debate/conversation easier.
I’ll try to organize the basic thought more cleanly, and will comment here again with a link to the better version when it is ready :-)