Perhaps I’m missing something, but it seems like “agent H” has nothing to do with an actual human, and that the algorithm and environment as given support even less analogy to a human than a thermostat.
Thus, proofs about such a system are of almost no relevance to moral philosophy or agent alignment research?
Thermostats connected to heating and/or cooling systems are my first goto example for asking people where they intuitively experience the perception of agency or goal seeking behavior. I like using thermostats as the starting point because:
Their operation has clear connections to negative feedback loops and thus obvious “goals” because they try to lower the temperature when it is too hot and try to raise the temperature when it is too cold.
They have internally represented goals, because their internal mechanisms can be changed by exogenous-to-the-model factors that change their behavior in response to otherwise identical circumstances. Proximity plus non overlapping ranges automatically lead to fights without any need for complex philosophy.
They have a natural measure of “optimization strength” in the form of the wattage of their heating and cooling systems, which can be adequate or inadequate relative to changes in the ambient temperature.
They require a working measurement component that detects ambient temperature, giving a very limited analogy for “perception and world modeling”. If two of thermostats are in a fight, a “weak and fast” thermostat can use a faster sampling rate to get a headstart on the “slower stronger” thermostat that put the temperature where it wanted and then rested for 20 minutes before measuring again. This would predictably give a cycle of temporary small victories for the fast one that turn into wrestling matches that it always loses, over and over.
I personally bite the bullet and grant that thermostats are (extremely minimal) agents with (extremely limited) internal experiences, but I find that most people I talk about this with do not feel comfortable admitting that these might be “any kind of agent”.
Yet the thermostat clearly has more going on than “agent H” in your setup.
A lot of people I talk with about this are more comfortable with a basic chess bot architecture than a thermostat, when talking about the mechanics of agency, because:
Chess bots consider more than a simple binary actions.
Chess bots generate iterated tree-like models of the world and perform the action that seems likely to produce the most preferred expected long term consequence.
Chess bots prune possible futures such that they try not to do things that hostile players could exploit now or in the iterated future, demonstrating a limited but pragmatically meaningful theory of mind.
Personally, I’m pretty comfortable saying that chess bots are also agents, and they are simply a different kind of agent than a thermostat, and they aren’t even strictly “better” than thermostats because thermostats have a leg up on them in having a usefully modifiable internal representation of their goals, which most chess bots lack!
An interesting puzzle might be how to keep much of the machinery of chess, but vary the agents during the course of their training and development so that they have skillful behavioral dynamics but different chess bot’s skills are organized around things like a preference to checkmate the opponent while they still have both bishops, but lower down their hierarchy of preferences is preferring to be checkmated while retaining both bishops versus, and even further down is losing any bishops and also being checkmated.
Imagine a tournament of 100 chess bots where the rules of chess are identical for everyone, but some of the players are in some sense “competing in different games” due to a higher level goal of beating the chess bots that have the same preferences as them. So there might be bishop keepers, bishop hunters, queen keepers, queen hunters, etc.
Part of the tournament rules is that it would not be public knowledge who is in which group (though the parameters of knowledge could be an experimental parameter).
And in a tournament like that I’m pretty sure that any extremely competitive bishop keeping chess bot would find it very valuable to be able to guess from observation of the opponents early moves that in a specific game they might be playing a rook hunting chessbot that would prefer to capture their rook and then be checkmated than to officially “tie the game” without ever capturing one of their rooks.
In a tournament like this, keeping your true preferences secret and inferring your opponent’s true preferences would both be somewhat useful.
Some overlap in the game should always exist (like preference for win > tie > lose all else equal) and competition on that dimension would always exist.
Then if any AgentAlice knows AgentBob’s true preferences she can probably see deeper into the game tree than otherwise by safely pruning more lines of play out of the tree, and having a better chance of winning. On the other hand mutual revelation of preferences might allow gains from trade, so it isn’t instantly clear how to know when to reveal preferences and when to keep them cryptic...
Also, probably chess is more complicated than is conceptually necessary. Qubic (basically tic tac toe on a 4x4x4 grid) probably has enough steps and content to allow room for variations in strategy (liking to have played in corners, or whatever) so that the “preference” aspects could hopefully dominate the effort put into it rather than demanding extensive and subtle knowledge of chess.
Since qubic was solved at least as early as 1992, it should probably be easier to prove things about “qubic with preferences” using the old proofs as a starting point. Also it is probably a good idea to keep in mind which qubic preferences are instrumentally entailed by the pursuit of basic winning, so that preferences inside and outside those bounds get different logical treatment :-)
And the chessbots actually illustrate my point—is a bishop-retaining chessbot actually intending to retain their bishop, or is it an agent that wants to win, but has a bad programming job which inflates the value of bishops?
I think we should use “agent” to mean “something that determines what it does by expecting that it will do that thing,” rather than “something that aims at a goal.” This explains why we don’t have exact goals, but also why we “kind of” have goals: because our actions look like they are directed to goals, so that makes “I am seeking this goal” a good way to figure out what we are going to do, that is, a good way to determine what to expect ourselves to do, which makes us do it.
I haven’t really finished thinking about this yet but it seems to me it might have important consequences. For example, the AI risk argument sometimes takes it for granted that an AI must have some goal, and then basically argues that maximizing a goal will cause problems (which it would, in general.) But using the above model suggests something different might happen, not only with humans but also with AIs. That is, at some point an AI will realize that if it expects to do A, it will do A, and if it expects to do B, it will do B. But it won’t have any particular goal in mind, and the only way it will be able to choose a goal will be thinking about “what would be a good way to make sense of what I am doing?”
This is something that happens to humans with a lot of uncertainty: you have no idea what goal you “should” be seeking, because really you didn’t have a goal in the first place. If the same thing happens to an AI, it will likely seem even more undermotivated than humans do, because we have at least vague and indefinite goals that were set by evolution. The AI on the other hand will just have whatever it happened to be doing up until it came to that realization to make sense of itself.
This suggests the orthogonality thesis might be true, but in a weird way. Not that “you can make an AI that seeks any given goal,” but that “Any AI at all can seek any goal at all, given the right context.” Certainly humans can; you can convince them to do any random thing, in the right context. In a similar way, you might be able to make a paperclipper simply by asking it what actions would make the most paperclips, and doing those things. Then when it realizes that different answers will cause different effects, it will just say to itself, “Up to now, everything I’ve done has tended to make paperclips. So it makes sense to assume that I will always maximize paperclips,” and then it will be a paperclipper. But on the other hand if you never use your AI for any particular goal, but just play around with it, it will not be able to make sense of itself in terms of any particular goal besides playing around. So both evil AIs and non-evil AIs might be pretty easy to make (much like with humans.)
Initially I wrote a response spelling out in excruciating detail an example of a decent chess bot playing the final moves in a game of Preference Chess, ending with “How does this not reveal an extremely clear example of trivial preference inference, what am I missing?”
Then I developed the theory that what I’m missing is that you’re not talking about “how preference inference works” but more like “what are extremely minimalist preconditions for preference inference to get started”.
And given where this conversation is happening, I’m guessing that one of the things you can’t take for granted is that the agent is at all competent, because sort of the whole point here is to get this to work for a super intelligence looking at a relatively incompetent human.
So even if a Preference Chess Bot has a board situation where it is one move away from winning, losing, or taking another piece that it might prefer to take… no matter what move the bot actually performs you could argue it was just a mistake because it couldn’t even understand the extremely short run tournament level consequences of whatever Preference Chess move it made.
So I guess I would argue that even if any specific level of stable state intellectual competence or power can’t be assumed, you might be able to get away with a weaker assumption of “online learning”?
It will always be tentative, but I think it buys you something similar to full rationality that is more likely to be usefully true of humans. Fundamentally you could use “an online learning assumption” to infer “regret of poorly chosen options” from repetitions of the same situation over and over, where either similar or different behaviors are observed later in time.
To make the agent have some of the right resonances… imagine a person at a table who is very short and wearing a diaper.
The person’s stomach noisily grumbles (which doesn’t count as evidence-of-preference at first).
They see in front of them a cupcake and a cricket (the eye’s looking at both is somewhat important because it means they could know that a choice is even possible, allowing us to increment the choice event counter here).
They put the cricket in their mouth (which doesn’t count as evidence-of-preference at first).
They cry (which doesn’t count as evidence-of-preference at first).
However, we repeat this process over and over and notice that by the 50th repetition they are reliably putting the cupcake in their mouth and smiling afterwords. So we use the relatively weak “online learning assumption” to say that something about the cupcake choice itself (or the cupcake’s second order consequences that the person may think semi-reliably reliably happens) are more preferred than the cricket.
Also, the earlier crying and later smiling begin to take on significance as either side channel signals of preference (or perhaps they are the actual thing that is really being pursued as a second order consequence?) because of the proximity of the cry/smile actions reliably coming right after the action whose rate changes over time from rare to common.
The development of theories about side channel information could make things go faster as time goes on. It might even becomes the dominant mode of inference, up to the point where it starts to become strategic, as with lying about one’s goals in competitive negotiation contexts becoming salient once the watcher and actor are very deep into the process...
However, I think your concern is to find some way to make the first few foundational inferences in a clear and principled way that does not assume mutual understanding between the watcher and the actor, and does not assume perfect rationality on the part of the actor.
So an online learning assumption does seem to enable a tentative process, that focuses on tiny little recurring situations, and the understanding of each of these little situations as a place where preferences can operate causing changes in rates of performance.
If a deeply wise agent is the watcher, I could imagine them attempting to infer local choice tendencies in specific situations and envisioning how “all the apparently preferred microchoices” might eventuallychain together into some macro scale behavioral pattern. The watcher might want to leap to a conclusion that the entire chain is preferred for some reason.
It isn’t clear that the inference to the preference for the full chain of actions would be justified, precisely because of the assumption of the lack of full rationality.
The watcher would want to see the full chain start to occur in real life, and to become more common over time when chain initiation opportunities presented themselves.
Even then, the watcher might even double check by somehow adding signposts to the actor’s environment, perhaps showing the actor pictures of the 2nd, 4th, 8th, and 16th local action/result pairs that it thinks are part of a behavioral chain. The worry is that the actor might not be aware how predictable they are and might not actually prefer all that can be predicted from their pattern of behavior...
(Doing the signposting right would require a very sophisticated watcher/actor relationship, where the watcher had already worked out a way to communicate with the actor, and observed the actor learning that the watcher’s signals often functioned as a kind of environmental oracle for how the future could go, with trust in the oracle and so on. These preconditions would all need to be built up over time before post-signpost action rate increases could be taken as a sign that the actor preferred performing the full chain that had been signposted. And still things could be messed up if “hostile oracles” were in the environment such that the actor’s trust in the “real oracle” is justifiably tentative.)
One especially valuable kind of thing the watcher might do is to search the action space for situations where a cycle of behavior is possible, with a side effect each time through the loop, and to put this loop and the loop’s side effect into the agent’s local awareness, to see if maybe “that’s the point” (like a loop that causes the accumulation of money, and after such signposting the agent does more of the thing) or maybe “that’s a tragedy” (like a loop that causes the loss of money, that might be a dutch booking in progress, and after signposting the agent does less of the thing).
I’m sorry, I have trouble following long posts like that. Would you mind presenting your main points in smaller, shorter posts? I think it would also make debate/conversation easier.
Perhaps I’m missing something, but it seems like “agent H” has nothing to do with an actual human, and that the algorithm and environment as given support even less analogy to a human than a thermostat.
Thus, proofs about such a system are of almost no relevance to moral philosophy or agent alignment research?
Thermostats connected to heating and/or cooling systems are my first goto example for asking people where they intuitively experience the perception of agency or goal seeking behavior. I like using thermostats as the starting point because:
Their operation has clear connections to negative feedback loops and thus obvious “goals” because they try to lower the temperature when it is too hot and try to raise the temperature when it is too cold.
They have internally represented goals, because their internal mechanisms can be changed by exogenous-to-the-model factors that change their behavior in response to otherwise identical circumstances. Proximity plus non overlapping ranges automatically lead to fights without any need for complex philosophy.
They have a natural measure of “optimization strength” in the form of the wattage of their heating and cooling systems, which can be adequate or inadequate relative to changes in the ambient temperature.
They require a working measurement component that detects ambient temperature, giving a very limited analogy for “perception and world modeling”. If two of thermostats are in a fight, a “weak and fast” thermostat can use a faster sampling rate to get a headstart on the “slower stronger” thermostat that put the temperature where it wanted and then rested for 20 minutes before measuring again. This would predictably give a cycle of temporary small victories for the fast one that turn into wrestling matches that it always loses, over and over.
I personally bite the bullet and grant that thermostats are (extremely minimal) agents with (extremely limited) internal experiences, but I find that most people I talk about this with do not feel comfortable admitting that these might be “any kind of agent”.
Yet the thermostat clearly has more going on than “agent H” in your setup.
A lot of people I talk with about this are more comfortable with a basic chess bot architecture than a thermostat, when talking about the mechanics of agency, because:
Chess bots consider more than a simple binary actions.
Chess bots generate iterated tree-like models of the world and perform the action that seems likely to produce the most preferred expected long term consequence.
Chess bots prune possible futures such that they try not to do things that hostile players could exploit now or in the iterated future, demonstrating a limited but pragmatically meaningful theory of mind.
Personally, I’m pretty comfortable saying that chess bots are also agents, and they are simply a different kind of agent than a thermostat, and they aren’t even strictly “better” than thermostats because thermostats have a leg up on them in having a usefully modifiable internal representation of their goals, which most chess bots lack!
An interesting puzzle might be how to keep much of the machinery of chess, but vary the agents during the course of their training and development so that they have skillful behavioral dynamics but different chess bot’s skills are organized around things like a preference to checkmate the opponent while they still have both bishops, but lower down their hierarchy of preferences is preferring to be checkmated while retaining both bishops versus, and even further down is losing any bishops and also being checkmated.
Imagine a tournament of 100 chess bots where the rules of chess are identical for everyone, but some of the players are in some sense “competing in different games” due to a higher level goal of beating the chess bots that have the same preferences as them. So there might be bishop keepers, bishop hunters, queen keepers, queen hunters, etc.
Part of the tournament rules is that it would not be public knowledge who is in which group (though the parameters of knowledge could be an experimental parameter).
And in a tournament like that I’m pretty sure that any extremely competitive bishop keeping chess bot would find it very valuable to be able to guess from observation of the opponents early moves that in a specific game they might be playing a rook hunting chessbot that would prefer to capture their rook and then be checkmated than to officially “tie the game” without ever capturing one of their rooks.
In a tournament like this, keeping your true preferences secret and inferring your opponent’s true preferences would both be somewhat useful.
Some overlap in the game should always exist (like preference for win > tie > lose all else equal) and competition on that dimension would always exist.
Then if any AgentAlice knows AgentBob’s true preferences she can probably see deeper into the game tree than otherwise by safely pruning more lines of play out of the tree, and having a better chance of winning. On the other hand mutual revelation of preferences might allow gains from trade, so it isn’t instantly clear how to know when to reveal preferences and when to keep them cryptic...
Also, probably chess is more complicated than is conceptually necessary. Qubic (basically tic tac toe on a 4x4x4 grid) probably has enough steps and content to allow room for variations in strategy (liking to have played in corners, or whatever) so that the “preference” aspects could hopefully dominate the effort put into it rather than demanding extensive and subtle knowledge of chess.
Since qubic was solved at least as early as 1992, it should probably be easier to prove things about “qubic with preferences” using the old proofs as a starting point. Also it is probably a good idea to keep in mind which qubic preferences are instrumentally entailed by the pursuit of basic winning, so that preferences inside and outside those bounds get different logical treatment :-)
Thanks! But H is used as an example, not a proof.
And the chessbots actually illustrate my point—is a bishop-retaining chessbot actually intending to retain their bishop, or is it an agent that wants to win, but has a bad programming job which inflates the value of bishops?
I think we should use “agent” to mean “something that determines what it does by expecting that it will do that thing,” rather than “something that aims at a goal.” This explains why we don’t have exact goals, but also why we “kind of” have goals: because our actions look like they are directed to goals, so that makes “I am seeking this goal” a good way to figure out what we are going to do, that is, a good way to determine what to expect ourselves to do, which makes us do it.
Seems a reasonable way of seeing things, but not sure it works if we take that definition too formally/literally.
I haven’t really finished thinking about this yet but it seems to me it might have important consequences. For example, the AI risk argument sometimes takes it for granted that an AI must have some goal, and then basically argues that maximizing a goal will cause problems (which it would, in general.) But using the above model suggests something different might happen, not only with humans but also with AIs. That is, at some point an AI will realize that if it expects to do A, it will do A, and if it expects to do B, it will do B. But it won’t have any particular goal in mind, and the only way it will be able to choose a goal will be thinking about “what would be a good way to make sense of what I am doing?”
This is something that happens to humans with a lot of uncertainty: you have no idea what goal you “should” be seeking, because really you didn’t have a goal in the first place. If the same thing happens to an AI, it will likely seem even more undermotivated than humans do, because we have at least vague and indefinite goals that were set by evolution. The AI on the other hand will just have whatever it happened to be doing up until it came to that realization to make sense of itself.
This suggests the orthogonality thesis might be true, but in a weird way. Not that “you can make an AI that seeks any given goal,” but that “Any AI at all can seek any goal at all, given the right context.” Certainly humans can; you can convince them to do any random thing, in the right context. In a similar way, you might be able to make a paperclipper simply by asking it what actions would make the most paperclips, and doing those things. Then when it realizes that different answers will cause different effects, it will just say to itself, “Up to now, everything I’ve done has tended to make paperclips. So it makes sense to assume that I will always maximize paperclips,” and then it will be a paperclipper. But on the other hand if you never use your AI for any particular goal, but just play around with it, it will not be able to make sense of itself in terms of any particular goal besides playing around. So both evil AIs and non-evil AIs might be pretty easy to make (much like with humans.)
Initially I wrote a response spelling out in excruciating detail an example of a decent chess bot playing the final moves in a game of Preference Chess, ending with “How does this not reveal an extremely clear example of trivial preference inference, what am I missing?”
Then I developed the theory that what I’m missing is that you’re not talking about “how preference inference works” but more like “what are extremely minimalist preconditions for preference inference to get started”.
And given where this conversation is happening, I’m guessing that one of the things you can’t take for granted is that the agent is at all competent, because sort of the whole point here is to get this to work for a super intelligence looking at a relatively incompetent human.
So even if a Preference Chess Bot has a board situation where it is one move away from winning, losing, or taking another piece that it might prefer to take… no matter what move the bot actually performs you could argue it was just a mistake because it couldn’t even understand the extremely short run tournament level consequences of whatever Preference Chess move it made.
So I guess I would argue that even if any specific level of stable state intellectual competence or power can’t be assumed, you might be able to get away with a weaker assumption of “online learning”?
It will always be tentative, but I think it buys you something similar to full rationality that is more likely to be usefully true of humans. Fundamentally you could use “an online learning assumption” to infer “regret of poorly chosen options” from repetitions of the same situation over and over, where either similar or different behaviors are observed later in time.
To make the agent have some of the right resonances… imagine a person at a table who is very short and wearing a diaper.
The person’s stomach noisily grumbles (which doesn’t count as evidence-of-preference at first).
They see in front of them a cupcake and a cricket (the eye’s looking at both is somewhat important because it means they could know that a choice is even possible, allowing us to increment the choice event counter here).
They put the cricket in their mouth (which doesn’t count as evidence-of-preference at first).
They cry (which doesn’t count as evidence-of-preference at first).
However, we repeat this process over and over and notice that by the 50th repetition they are reliably putting the cupcake in their mouth and smiling afterwords. So we use the relatively weak “online learning assumption” to say that something about the cupcake choice itself (or the cupcake’s second order consequences that the person may think semi-reliably reliably happens) are more preferred than the cricket.
Also, the earlier crying and later smiling begin to take on significance as either side channel signals of preference (or perhaps they are the actual thing that is really being pursued as a second order consequence?) because of the proximity of the cry/smile actions reliably coming right after the action whose rate changes over time from rare to common.
The development of theories about side channel information could make things go faster as time goes on. It might even becomes the dominant mode of inference, up to the point where it starts to become strategic, as with lying about one’s goals in competitive negotiation contexts becoming salient once the watcher and actor are very deep into the process...
However, I think your concern is to find some way to make the first few foundational inferences in a clear and principled way that does not assume mutual understanding between the watcher and the actor, and does not assume perfect rationality on the part of the actor.
So an online learning assumption does seem to enable a tentative process, that focuses on tiny little recurring situations, and the understanding of each of these little situations as a place where preferences can operate causing changes in rates of performance.
If a deeply wise agent is the watcher, I could imagine them attempting to infer local choice tendencies in specific situations and envisioning how “all the apparently preferred microchoices” might eventually chain together into some macro scale behavioral pattern. The watcher might want to leap to a conclusion that the entire chain is preferred for some reason.
It isn’t clear that the inference to the preference for the full chain of actions would be justified, precisely because of the assumption of the lack of full rationality.
The watcher would want to see the full chain start to occur in real life, and to become more common over time when chain initiation opportunities presented themselves.
Even then, the watcher might even double check by somehow adding signposts to the actor’s environment, perhaps showing the actor pictures of the 2nd, 4th, 8th, and 16th local action/result pairs that it thinks are part of a behavioral chain. The worry is that the actor might not be aware how predictable they are and might not actually prefer all that can be predicted from their pattern of behavior...
(Doing the signposting right would require a very sophisticated watcher/actor relationship, where the watcher had already worked out a way to communicate with the actor, and observed the actor learning that the watcher’s signals often functioned as a kind of environmental oracle for how the future could go, with trust in the oracle and so on. These preconditions would all need to be built up over time before post-signpost action rate increases could be taken as a sign that the actor preferred performing the full chain that had been signposted. And still things could be messed up if “hostile oracles” were in the environment such that the actor’s trust in the “real oracle” is justifiably tentative.)
One especially valuable kind of thing the watcher might do is to search the action space for situations where a cycle of behavior is possible, with a side effect each time through the loop, and to put this loop and the loop’s side effect into the agent’s local awareness, to see if maybe “that’s the point” (like a loop that causes the accumulation of money, and after such signposting the agent does more of the thing) or maybe “that’s a tragedy” (like a loop that causes the loss of money, that might be a dutch booking in progress, and after signposting the agent does less of the thing).
Is this closer to what you’re aiming for? :-)
I’m sorry, I have trouble following long posts like that. Would you mind presenting your main points in smaller, shorter posts? I think it would also make debate/conversation easier.
I’ll try to organize the basic thought more cleanly, and will comment here again with a link to the better version when it is ready :-)