So I guess I have to throw in the towel and say that I cannot predict your objection.
Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)
My guess is that you’re going to say ~”ok, I now accept that it’s important for the sequece of coin tosses to fully determine all of the human’s shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold”. This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI’s shutdown.
This is starting to feel like an epistemic state that at least superficially looks like it matches your claims (“the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button”), which is some evidence for communication.
I can’t break this one in 30s, which is progress, and I’ve updated accordingly =D.
(Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it’s worth considering that it’s controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it’s in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I’ve said, I don’t need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)
My current attempt to distill the idea: Suppose that the AI is certain that the human’s desire to shut the AI down over time is determined by a random variable (of type ‘function from time to desire-level’) whose value is independent of the AI’s behavior, and the AI believes its utility is B if that desire-level ever exceeds a fixed threshold and V otherwise.
My current epistemic status on this proposal is that it’s a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it’s still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn’t imagine benefits to manipulating the humans with regard to the shutdown button.
I would be not at all shocked if someone came along, thought about it for 5 minutes, and said “consider the following clever strategy”. …and the moment I wrote that, my brain observed that you can steal this AI’s lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.
Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human’s shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.
(NB: after talking with someone at MIRI who’s better at distillation than I am, I now think that this is essentially isomorphic to the simpler proposal where there is only one extraphysical latent variable believes in, which determines whether the AI system should be shut down. This AI acts as if that variable is independent fo its actions, but in fact starts believing the variable is ‘true’ if the physical shutdown button is depressed. For the historical record, that’s a proposal we’re familiar with (though I’m not sure anyone’s written it up publicly). Your proposal is actually sligthly different than the one we’ve investigated; in ours the AI disbelieves that the button is evidence about the extraphysical variable (and starts acting like the variable is true when the button is pressed anyway); in yours the AI believes that the button is evidence about the extraphysical variable, but disbelieves that any of its actions can have a causal impact on the button. This gains an incentive to keep the button around, at the expense of a falser/brittler epistemic state. I might post some chat logs from the aforementioned discussion later, or I might not. Either way, my current status on your proposal is that I think it has the core of a good idea in it, albeit one that doesn’t push me much towards expecting a satisfying solution, and one that has a bunch of extraneous cruft. I continue to applaud you taking a whack at the problem.)
Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)
I don’t follow. If you scare the humans, then that might make them afraid for their safety and make them try to hide that they want to shut it down until they have a chance to do so without the AI being able to take revenge. Much more reasonable would be to make the shutdown button as available as possible.
I guess maybe it depends on what we mean by “scaring”. I had in mind something like, starting to threaten the humans etc.. But there probably are some psychological manipulations that can be performed to encourage people to press it more quickly. For instance, acting non-threatening but also pretending that it is making the button inaccessible if they don’t hurry. Though that only really works once. But I wouldn’t expect there to never be anything problematic that could be done here. So maybe.
Basically the issue is, if a person wants to press the button and they can press the button, then presumably they will press the button. So the AI doesn’t need to do anything special to figure out whether people want to press the button, they’re going to volunteer that information automatically as long as the environment affords it.
(I’d also add that I phrased the timing of the button in a satisficing-style way, “within 1 minute”, precisely to avoid these sorts of shenanigans. Though obviously satisficing has been shown to have plenty of problems too so it’s not clear it would work.)
My guess is that you’re going to say ~”ok, I now accept that it’s important for the sequece of coin tosses to fully determine all of the human’s shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold”. This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI’s shutdown.
This is starting to feel like an epistemic state that at least superficially looks like it matches your claims (“the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button”), which is some evidence for communication.
No, I don’t think this would work, as given an infinite number of coin flips they would with ~100% probability want to press the button at some point, which puts us in the s branch immediately and so under this proposal the AI would just press the button.
I think there may be some variant of this that could work. Like if you give the AI reward proportional to Bs+rf (where r is a reward function for V) for its current world-state (rather than picking a policy that maximizes Bs+Vf overall; so one difference is that you’d be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and V happens when they don’t. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like (Bs+r)f, then it could work better (though the Bs part would need a time delay...). Though this reward function might leave it open to the “trying to shut down the AI for reasons” objection that you gave before; I think that’s fixed by moving the f counterfactual outside of the sum over rewards, but I’m not sure.
These are the sort of changes I’ve alluded to elsewhere as possibly leading to some “best of both worlds” results. Though I’m still thinking about the details as I don’t think I fully understand them yet; if I did I would have written a post about it by now I think.
(Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it’s worth considering that it’s controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it’s in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I’ve said, I don’t need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)
I think the AI would see plenty of evidence that this heuristic doesn’t work for human preferences to shut it down, and carve out an exception accordingly. Though again I will grant that if this didn’t involve a shutdown then it might later accumulate enough evidence to overwhelm what happened during training.
Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human’s shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.
This is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human’s preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.
This seems to be what sets my proposal apart from most “utility indifference proposals”, which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans. E.g. the MIRI proposal you talked about in a comment downstream.
If I had to claim a “family” of proposals as being related to mine, then it would be proposals that define the AI’s preference ordering in terms of a counterfactual p over human preferences, Up, such that U contains something intended to be an appropriate response to p.
(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I’m trying to prepare for an explainer post. For instance, a sort of “encapsulation”—if you’re a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world’s outcome must be “as if” the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I’m still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe’s conservation laws.)
My current epistemic status on this proposal is that it’s a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it’s still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn’t imagine benefits to manipulating the humans with regard to the shutdown button.
I would agree that a randomly or unprincipally selected falsehood would be deeply concerning. However, I think it is worth keeping in mind that it is not a randomly selected falsehood, but instead something selected to optimize a specific outer utility function (Bs+Vf), and that the system that is optimizing this utility function does not believe the falsehood.
Therefore we should be able to reason about its consequences not in terms of the falsehood, but in terms of the outer utility function + the true reality. We might understand the utility function as specifying that humans should causally determine whether the AI is shut down or whether V gets optimized. Of course then it becomes necessary to actually be sure that the utility function is specified correctly, per the previous discussion of alternatives like (Bs+r)f and such.
I would be not at all shocked if someone came along, thought about it for 5 minutes, and said “consider the following clever strategy”. …and the moment I wrote that, my brain observed that you can steal this AI’s lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.
🤔 Maybe “I bet you $100 000 that if you did <something very unsafe>, you would not get shut down”. The AI would think it is unlikely to get shut down, and so think “$100 000, just for doing <something very unsafe>? sign me up!”.
Maybe this proposal needs to be combined with something else, like /u/johnswentworth’s sub-agent argument. Though I definitely haven’t thought this through, and I don’t know how one would even train a subagent-based neural network policy.
Your inability to guess updates me towards thinking that you want to take back what you said about the effects of the counterfactuals matching the actual physics whenever possible. (Insofar as they do, as in the case of the specific state under discussion, that AI wants to pick up a human and scare them. This makes it be the case that insofar as the latest coin permits shutdown then the shutdown-preference of the humans is revealed as quickly as possible.)
My guess is that you’re going to say ~”ok, I now accept that it’s important for the sequece of coin tosses to fully determine all of the human’s shutdown desires, with the B/V branch determined by whether the shutdown desires ever cross a given threshold”. This suggests a different epistemic+instrumental state, where the agent thinks that an infinitude of coins are tossed, and those coinflips fully determine the path through desire-space that the humans take with respect to the AI’s shutdown.
This is starting to feel like an epistemic state that at least superficially looks like it matches your claims (“the AI has incentives to watch humans really closely to see if their shutdown desire ever exceeds the threshold, but not to manipulate the humans about the button”), which is some evidence for communication.
I can’t break this one in 30s, which is progress, and I’ve updated accordingly =D.
(Tho ofc I still think the claim that you can get an AI into this epistemic state by training it against models that have had counterfactual surgery done to them is false. In this new epistemic+instrumenal state we have another intuition pump: a deep and general heuristic is that, whenever something in the environment that was previously stable, changes sharply just after your action, it’s worth considering that it’s controlled by your action, and by default this sort of generalization is going to cause your AI to hypothesize that it’s in control of the extra-dimensional coin-sequence that it thinks controls the human beliefs, which means that when you put it in the real world it by default starts believing (correctly, I might add) that which branch of the utility function is live is under its control, which brings the manipulation incentives back in insofar as the value of B differs from the value of V. But as I’ve said, I don’t need to know a training regime for getting an AI into a given epistemic+instrumental state to be enthusiastic about an epistemic+instrumental state which in fact makes progress on the shutdown problem.)
My current attempt to distill the idea: Suppose that the AI is certain that the human’s desire to shut the AI down over time is determined by a random variable (of type ‘function from time to desire-level’) whose value is independent of the AI’s behavior, and the AI believes its utility is B if that desire-level ever exceeds a fixed threshold and V otherwise.
My current epistemic status on this proposal is that it’s a bit uncomfortable insofar as it involves the AI believing (or at least acting under the assumption of) an empirical falsehood, but it’s still nice to know an empirical falsehood that has the property that insofar as your AI assumes it when considering actions, it doesn’t imagine benefits to manipulating the humans with regard to the shutdown button.
I would be not at all shocked if someone came along, thought about it for 5 minutes, and said “consider the following clever strategy”. …and the moment I wrote that, my brain observed that you can steal this AI’s lunch money in the real world by betting about whether humans will start desiring it to shutdown the moment it starts acting scary. Such are the perils of having your AI act under false assumptions.
Ok, so distilling the idea further, the insight-afaict is that an AI that is assuming its actions have no effect on human’s shutdown-desires sees no reason to deceive the humans about whether its purposes are contrary to theirs. This seems like a worthwhile observation! Nice.
(NB: after talking with someone at MIRI who’s better at distillation than I am, I now think that this is essentially isomorphic to the simpler proposal where there is only one extraphysical latent variable believes in, which determines whether the AI system should be shut down. This AI acts as if that variable is independent fo its actions, but in fact starts believing the variable is ‘true’ if the physical shutdown button is depressed. For the historical record, that’s a proposal we’re familiar with (though I’m not sure anyone’s written it up publicly). Your proposal is actually sligthly different than the one we’ve investigated; in ours the AI disbelieves that the button is evidence about the extraphysical variable (and starts acting like the variable is true when the button is pressed anyway); in yours the AI believes that the button is evidence about the extraphysical variable, but disbelieves that any of its actions can have a causal impact on the button. This gains an incentive to keep the button around, at the expense of a falser/brittler epistemic state. I might post some chat logs from the aforementioned discussion later, or I might not. Either way, my current status on your proposal is that I think it has the core of a good idea in it, albeit one that doesn’t push me much towards expecting a satisfying solution, and one that has a bunch of extraneous cruft. I continue to applaud you taking a whack at the problem.)
I don’t follow. If you scare the humans, then that might make them afraid for their safety and make them try to hide that they want to shut it down until they have a chance to do so without the AI being able to take revenge. Much more reasonable would be to make the shutdown button as available as possible.
I guess maybe it depends on what we mean by “scaring”. I had in mind something like, starting to threaten the humans etc.. But there probably are some psychological manipulations that can be performed to encourage people to press it more quickly. For instance, acting non-threatening but also pretending that it is making the button inaccessible if they don’t hurry. Though that only really works once. But I wouldn’t expect there to never be anything problematic that could be done here. So maybe.
Basically the issue is, if a person wants to press the button and they can press the button, then presumably they will press the button. So the AI doesn’t need to do anything special to figure out whether people want to press the button, they’re going to volunteer that information automatically as long as the environment affords it.
(I’d also add that I phrased the timing of the button in a satisficing-style way, “within 1 minute”, precisely to avoid these sorts of shenanigans. Though obviously satisficing has been shown to have plenty of problems too so it’s not clear it would work.)
No, I don’t think this would work, as given an infinite number of coin flips they would with ~100% probability want to press the button at some point, which puts us in the s branch immediately and so under this proposal the AI would just press the button.
I think there may be some variant of this that could work. Like if you give the AI reward proportional to Bs+rf (where r is a reward function for V) for its current world-state (rather than picking a policy that maximizes Bs+Vf overall; so one difference is that you’d be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and V happens when they don’t. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like (Bs+r)f, then it could work better (though the Bs part would need a time delay...). Though this reward function might leave it open to the “trying to shut down the AI for reasons” objection that you gave before; I think that’s fixed by moving the f counterfactual outside of the sum over rewards, but I’m not sure.
These are the sort of changes I’ve alluded to elsewhere as possibly leading to some “best of both worlds” results. Though I’m still thinking about the details as I don’t think I fully understand them yet; if I did I would have written a post about it by now I think.
I think the AI would see plenty of evidence that this heuristic doesn’t work for human preferences to shut it down, and carve out an exception accordingly. Though again I will grant that if this didn’t involve a shutdown then it might later accumulate enough evidence to overwhelm what happened during training.
This is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human’s preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.
This seems to be what sets my proposal apart from most “utility indifference proposals”, which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans. E.g. the MIRI proposal you talked about in a comment downstream.
If I had to claim a “family” of proposals as being related to mine, then it would be proposals that define the AI’s preference ordering in terms of a counterfactual p over human preferences, Up, such that U contains something intended to be an appropriate response to p.
(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I’m trying to prepare for an explainer post. For instance, a sort of “encapsulation”—if you’re a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world’s outcome must be “as if” the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I’m still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe’s conservation laws.)
I would agree that a randomly or unprincipally selected falsehood would be deeply concerning. However, I think it is worth keeping in mind that it is not a randomly selected falsehood, but instead something selected to optimize a specific outer utility function (Bs+Vf), and that the system that is optimizing this utility function does not believe the falsehood.
Therefore we should be able to reason about its consequences not in terms of the falsehood, but in terms of the outer utility function + the true reality. We might understand the utility function as specifying that humans should causally determine whether the AI is shut down or whether V gets optimized. Of course then it becomes necessary to actually be sure that the utility function is specified correctly, per the previous discussion of alternatives like (Bs+r)f and such.
🤔 Maybe “I bet you $100 000 that if you did <something very unsafe>, you would not get shut down”. The AI would think it is unlikely to get shut down, and so think “$100 000, just for doing <something very unsafe>? sign me up!”.
Maybe this proposal needs to be combined with something else, like /u/johnswentworth’s sub-agent argument. Though I definitely haven’t thought this through, and I don’t know how one would even train a subagent-based neural network policy.