You set the expected utility of stopping equal to the expected utility of not stopping in order to obtain the desired indifference between these two outcomes. It doesn’t matter that the former utility is a moving target—since you dynamically update the latter utility to track it.
You have two problems here. The first one is the one I mentioned—once you’ve set up the equality, what happens if the AI learns something that makes certain universes more likely than others?
For instance, let W1′ be a universe in which the AI has a backup, W1 one in which it does not, and similarly for W0′ and W0.
Initially, U(W0′)=U(W1′)=U(W0)=2 (it doesn’t care what happens if it’s backed up) and U(W1)=0 (it “doesn’t want to die”). Apply the filter, and get:
F(U)(W1)=1, F(U)(W1′)=3, F(U)(W0′)=2, F(U)(W0)=2.
So it’s indifferent. But then it discovers that it doesn’t have a backup; now the relevant ones are W1 and W0, and it prefers W0. So it’s no longer indifferent.
The other option is to have it change it’s utility every time new information comes in, to track the changes. But this is bad. For a start, it will no longer be an utility maximiser, which will exposes it to predictable weaknesses (see this ). Secondly, a self-improving AI will try and get rid of this as it self-improves, as self-improving AI’s move towards utility maximisers.
And lastly, it has all sorts of unintended consequences; the AI, for instance, may decided not to pay attention to certain information (or to only pay attention selectively) because this is the easiest way to accomplish its current goals.
If you update your utility every time new information comes in, the utility is time-inconsistent. This lets you be money-pumped. Hence it’s the kind of thing you would get rid of at you next self-improvement.
The utility function is always the same in this kind of scenario—and is not “updated”.
It typically says something roughly like: stop button not pressed: business as normal—stop button pressed: let the engineeres dismantle your brain. That doesn’t really let you be money-pumped because—for one thing, a pump needs repeated cycles to do much work. Also, after being switched off the agent can’t engage in any economic activities.
Agent’s won’t get rid of such stipulations as they self-improve—under the assumption that a self-improving agent successfully preserves its utility function. Changing the agent’s utility function would typically be very bad—from the point of view of the agent.
The other option is to have it change it’s utility every time new information comes in,
to track the changes.
Right.
But this is bad. For a start, it will no longer be an utility maximiser, which will
exposes it to predictable weaknesses (see [link]). Secondly, a self-improving
AI will try and get rid of this as it self-improves, as self-improving AI’s move
towards utility maximisers.
That doesn’t seem to make much sense. The machine maximises utility until its brain is switched off—when it stops doing that—for obvious reasons. Self improvement won’t make any difference to this—under the assumption that self-improvement successfully preserves the agent’s utility function.
Anyway, the long-term effect of self-improvement is kind-of irrelevant for machines that can be stopped. Say it gets it into its head to create some minions, and “forgets” that they also need to be switched off when the stop button is pressed. If a machine is improving itself in a way that you don’t like, you can typically stop it, reconfigure it, and then try again.
That doesn’t seem to make much sense. The machine maximises utility until its brain is switched off—when it stops doing that—for obvious reasons. Self improvement won’t make any difference to this—under the assumption that self-improvement successfully preserves the agent’s utility function.
An entity whose utility function is time-inconsistent will choose to modify itself into an entity whose utility function is time-consistent—because it’s much better able to achieve some approximation of its original goals if it can’t be money pumped (where it will achieve nearly nothing).
Anyway, the long-term effect of self-improvement is kind-of irrelevant for machines that can be stopped. Say it gets it into its head to create some minions, and “forgets” that they also need to be switched off when the stop button is pressed. If a machine is improving itself in a way that you don’t like, you can typically stop it, reconfigure it, and then try again.
Stalin could have been stopped—all it takes is a bullet through the brain, which is easy. An AI can worm itself into human society in such a way the “off switch” becomes useless; trying to turn it off will precipitate a disaster.
An entity whose utility function is time-inconsistent will choose to modify
itself into an entity whose utility function is time-consistent [...]
Here the agent wants different things under different circumstances -
which is perfectly permissable. Before the button is pressed,
it wants to do its day job, and after the button is pressed, it is
happy to let engineers dismantle its brain (or whetever).
You can’t “money-pump” a machine just because you can switch it off!
Also: many worlds? self-improvement? If this thread is actually about making
a machine indifferent, those seems like unnecessary complications—not caring is just not that difficult.
An AI can worm itself into human society in such a way the “off switch”
becomes useless; trying to turn it off will precipitate a disaster.
Maybe—if people let it—or if people want it to do that. An off switch
isn’t a magical solution to all possible problems. Google has an off switch -
but few can access it. Microsoft had an off switch—but sadly nobody
pressed it. Anyway, this is getting away from modelling indifference.
Do you understand that paper yourself? That paper is about general drives that agents will tend to exhibit—unless their utility function explicity tells them to behave otherwise. Having a utility function that tells you to do something different once a button has been pressed clearly fits into the latter category.
An example of an agent that wants different things under different circumstances is a fertile woman. Before she is pregnant, she wants one set of things, and after she is pregnant, she wants other, different things. However, her utility function hasn’t changed, just the circumstances in which she finds herself.
Can you make money from her by buying kids toys from her before she gets pregnant and selling them back to her once she has kids? Maybe so—if she didn’t know whether she was going to get pregnant or not—and that is perfectly OK.
Remember that the point of a stop button is usually as a safety feature. If you want your machine to make as much money for you as possible, by all means leave it turned on. However, if you want to check it is doing OK, at regular intervals, you should expect to pay some costs for the associated downtime.
Can I remind you what we are talking about; not about a single stop button, but about a “utility function” that is constantly modified whenever new information comes in. That’s the kind of weakness that will lead to systematic money pumping. The situation is more analogous to me being able to constantly change whether a woman is pregnant and back again, and buying and selling her children’s toys each time. I can do that, by the information presented to the AI. And the AI, no matter how smart, will be useless at resisting that, until the moment where it 1) stops being a utility maximiser or 2) fixes its utility function.
It’s not the fact the utility function is changing that is the problem, so self improving AI is fine. It’s the fact that its systematically changing in response to predictable inputs.
Can I remind you what we are talking about; not about a single stop button, but about a “utility function” that is constantly modified whenever new information comes in.
After backtracking—to try and understand what it is that you think we are talking about -
I think I can see what is going on here.
...you were using “utility” as abbreviation for “utility function”!
That would result in a changing utility function, and—in that context -
your comments make sense.
However, that represents a simple implementation mistake. You don’t
implement indifference by using a constantly-changing utility function.
What changes—in order to make the utility of being switched off track the utility
of being switched on—is just the utility associated with being switched off.
The utility function just has a component which says: “the expected utility
of being stopped is the same as if not stopped”. The utility
function always says that—and doesn’t change, regardless of sensory
inputs or whether the stop button has been pressed.
What changes is the utility—not the utility function. That is what
you wrote—but was apparently not what you meant—thus the confusion.
Yes, I apologise for the confusion. But what I showed in my post was that implementing “the expected utility of being stopped is the same as if not stopped” has to be done in a cunning way (the whole thing about histories having the same stem) or else extra information will get rid of indifference.
You set the expected utility of stopping equal to the expected utility of not stopping in order to obtain the desired indifference between these two outcomes. It doesn’t matter that the former utility is a moving target—since you dynamically update the latter utility to track it.
You have two problems here. The first one is the one I mentioned—once you’ve set up the equality, what happens if the AI learns something that makes certain universes more likely than others?
For instance, let W1′ be a universe in which the AI has a backup, W1 one in which it does not, and similarly for W0′ and W0.
Initially, U(W0′)=U(W1′)=U(W0)=2 (it doesn’t care what happens if it’s backed up) and U(W1)=0 (it “doesn’t want to die”). Apply the filter, and get:
F(U)(W1)=1, F(U)(W1′)=3, F(U)(W0′)=2, F(U)(W0)=2.
So it’s indifferent. But then it discovers that it doesn’t have a backup; now the relevant ones are W1 and W0, and it prefers W0. So it’s no longer indifferent.
The other option is to have it change it’s utility every time new information comes in, to track the changes. But this is bad. For a start, it will no longer be an utility maximiser, which will exposes it to predictable weaknesses (see this ). Secondly, a self-improving AI will try and get rid of this as it self-improves, as self-improving AI’s move towards utility maximisers.
And lastly, it has all sorts of unintended consequences; the AI, for instance, may decided not to pay attention to certain information (or to only pay attention selectively) because this is the easiest way to accomplish its current goals.
FWIW, I couldn’t make any sense out of the second supposed problem.
If you update your utility every time new information comes in, the utility is time-inconsistent. This lets you be money-pumped. Hence it’s the kind of thing you would get rid of at you next self-improvement.
The utility function is always the same in this kind of scenario—and is not “updated”.
It typically says something roughly like: stop button not pressed: business as normal—stop button pressed: let the engineeres dismantle your brain. That doesn’t really let you be money-pumped because—for one thing, a pump needs repeated cycles to do much work. Also, after being switched off the agent can’t engage in any economic activities.
Agent’s won’t get rid of such stipulations as they self-improve—under the assumption that a self-improving agent successfully preserves its utility function. Changing the agent’s utility function would typically be very bad—from the point of view of the agent.
Right.
That doesn’t seem to make much sense. The machine maximises utility until its brain is switched off—when it stops doing that—for obvious reasons. Self improvement won’t make any difference to this—under the assumption that self-improvement successfully preserves the agent’s utility function.
Anyway, the long-term effect of self-improvement is kind-of irrelevant for machines that can be stopped. Say it gets it into its head to create some minions, and “forgets” that they also need to be switched off when the stop button is pressed. If a machine is improving itself in a way that you don’t like, you can typically stop it, reconfigure it, and then try again.
An entity whose utility function is time-inconsistent will choose to modify itself into an entity whose utility function is time-consistent—because it’s much better able to achieve some approximation of its original goals if it can’t be money pumped (where it will achieve nearly nothing).
Stalin could have been stopped—all it takes is a bullet through the brain, which is easy. An AI can worm itself into human society in such a way the “off switch” becomes useless; trying to turn it off will precipitate a disaster.
Here the agent wants different things under different circumstances - which is perfectly permissable. Before the button is pressed, it wants to do its day job, and after the button is pressed, it is happy to let engineers dismantle its brain (or whetever).
You can’t “money-pump” a machine just because you can switch it off!
Also: many worlds? self-improvement? If this thread is actually about making a machine indifferent, those seems like unnecessary complications—not caring is just not that difficult.
Maybe—if people let it—or if people want it to do that. An off switch isn’t a magical solution to all possible problems. Google has an off switch - but few can access it. Microsoft had an off switch—but sadly nobody pressed it. Anyway, this is getting away from modelling indifference.
See http://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf where he argues why a general self-improving AI will seek to make a time consistent utility function.
Do you understand that paper yourself? That paper is about general drives that agents will tend to exhibit—unless their utility function explicity tells them to behave otherwise. Having a utility function that tells you to do something different once a button has been pressed clearly fits into the latter category.
An example of an agent that wants different things under different circumstances is a fertile woman. Before she is pregnant, she wants one set of things, and after she is pregnant, she wants other, different things. However, her utility function hasn’t changed, just the circumstances in which she finds herself.
Can you make money from her by buying kids toys from her before she gets pregnant and selling them back to her once she has kids? Maybe so—if she didn’t know whether she was going to get pregnant or not—and that is perfectly OK.
Remember that the point of a stop button is usually as a safety feature. If you want your machine to make as much money for you as possible, by all means leave it turned on. However, if you want to check it is doing OK, at regular intervals, you should expect to pay some costs for the associated downtime.
Yes.
Can I remind you what we are talking about; not about a single stop button, but about a “utility function” that is constantly modified whenever new information comes in. That’s the kind of weakness that will lead to systematic money pumping. The situation is more analogous to me being able to constantly change whether a woman is pregnant and back again, and buying and selling her children’s toys each time. I can do that, by the information presented to the AI. And the AI, no matter how smart, will be useless at resisting that, until the moment where it 1) stops being a utility maximiser or 2) fixes its utility function.
It’s not the fact the utility function is changing that is the problem, so self improving AI is fine. It’s the fact that its systematically changing in response to predictable inputs.
After backtracking—to try and understand what it is that you think we are talking about - I think I can see what is going on here.
When you wrote:
...you were using “utility” as abbreviation for “utility function”!
That would result in a changing utility function, and—in that context - your comments make sense.
However, that represents a simple implementation mistake. You don’t implement indifference by using a constantly-changing utility function. What changes—in order to make the utility of being switched off track the utility of being switched on—is just the utility associated with being switched off.
The utility function just has a component which says: “the expected utility of being stopped is the same as if not stopped”. The utility function always says that—and doesn’t change, regardless of sensory inputs or whether the stop button has been pressed.
What changes is the utility—not the utility function. That is what you wrote—but was apparently not what you meant—thus the confusion.
Yes, I apologise for the confusion. But what I showed in my post was that implementing “the expected utility of being stopped is the same as if not stopped” has to be done in a cunning way (the whole thing about histories having the same stem) or else extra information will get rid of indifference.