Interesting! (I didn’t read the post carefully, FYI.)
the AI will assume that the human changes their mind and stops pressing the button when evaluating Vf, because that’s what the counterfactual would involve.
Unfortunately, it will also assume that the human didn’t press the button, even after the human did press the button. If B is big, it will assume the human did press the button, even if they didn’t. Maybe this is quantitatively ok, because it can be pretty confident one way or the other...??
For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not.
There’s some weirdness here with what the AI thinks of this variable “should stop”, and how it interacts with the world. Does the AI think the humans could be mistaken about this variable? It might be much more worried about the humans mistakenly pressing, or more worried about mistakenly not pressing, depending on V and B, and so distort the humans’s info.
I also don’t see why the AI doesn’t disable the shutdown button, and then observe whether the humans try to press it, and then go “Ah okay, so B is the one true utility function. Now I will crazily optimize B” and do something crazy rather than actually shutting down *because the button was pressed*.
Unfortunately, it will also assume that the human didn’t press the button, even after the human did press the button.
Hm, I don’t think so:
I guess one thing that should be mentioned is that if the AI was trained in the way described in the blog post, its policy wouldn’t use causal decision theory, at least not according to the utility function described in the post. Rather, causal decision theory would be applied on a policy-wide level; it would maximize E[U|do(policy), WM] rather than E[U|do(action), WM] (where WM denotes the world model).
I don’t know whether maximizing E[U|do(policy)] will result in a policy that maximizes E[U’|do(action), WM’] for some U’ and WM’. My best guess for how it’d correspond is what I hinted to in the text with “Essentially, it respects people’s “free will”, treating it as an important source of information about what it should do.”, that the WM’ would contain an additional unmodifiable variable corresponding to which of the two counterfactual cases it is in, and that WM’ would then claim that this variable is what determines whether people try to shut down the AI, and that this variable also determines whether U=B or U=V.
So if it sees a human pressing the button, the policy won’t conclude that the human didn’t press the button, but instead will conclude that it is in the U=B scenario. Though also hopefully in this scenario the AI would shut down.
Does the AI think the humans could be mistaken about this variable?
No. Though you could make a variant where it can think that, just by making the counterfactual more noisy.
I also don’t see why the AI doesn’t disable the shutdown button, and then observe whether the humans try to press it, and then go “Ah okay, so B is the one true utility function. Now I will crazily optimize B” and do something crazy rather than actually shutting down *because the button was pressed*.
B means “the button gets pressed and the AI shuts down”. So it can’t optimize B without shutting down. But yes, the AI would observe whether the humans try to press the shutdown button, that’s the point I made with “The AI will build an elaborate system to observe whether people want to press the stop button.”.
Does the AI think the humans could be mistaken about this variable?
I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
So if it sees a human pressing the button, the policy won’t conclude that the human didn’t press the button, but instead will conclude that it is in the U=B scenario
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn’t think that they could mistakenly want to press the button.
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
You’d have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.
However, it doesn’t think that they could mistakenly want to press the button.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)
There’s also something weird being assumed, about it making sense to define utility functions that only care about some counterfactual worlds. (I mean this is a reasonable assumption that people make, but it seems weird in general.) Like, this seems in tension with acausal bargaining / threats. If V_f wants V, doesn’t it want what V says is good, and V might have opinions about other worlds (for example: “there shouldn’t be torture, anywhere, even in counterfactual worlds”), and so optimizing for V_f optimizes even worlds where not-f?
If V has counterfactuals that cancel out the f in Vf, then I could see the results getting pretty funky, yes. But I’m imagining that V limits itself to counterfactuals that don’t cancel out the f.
Interesting! (I didn’t read the post carefully, FYI.)
Unfortunately, it will also assume that the human didn’t press the button, even after the human did press the button. If B is big, it will assume the human did press the button, even if they didn’t. Maybe this is quantitatively ok, because it can be pretty confident one way or the other...??
There’s some weirdness here with what the AI thinks of this variable “should stop”, and how it interacts with the world. Does the AI think the humans could be mistaken about this variable? It might be much more worried about the humans mistakenly pressing, or more worried about mistakenly not pressing, depending on V and B, and so distort the humans’s info.
I also don’t see why the AI doesn’t disable the shutdown button, and then observe whether the humans try to press it, and then go “Ah okay, so B is the one true utility function. Now I will crazily optimize B” and do something crazy rather than actually shutting down *because the button was pressed*.
Hm, I don’t think so:
I guess one thing that should be mentioned is that if the AI was trained in the way described in the blog post, its policy wouldn’t use causal decision theory, at least not according to the utility function described in the post. Rather, causal decision theory would be applied on a policy-wide level; it would maximize E[U|do(policy), WM] rather than E[U|do(action), WM] (where WM denotes the world model).
I don’t know whether maximizing E[U|do(policy)] will result in a policy that maximizes E[U’|do(action), WM’] for some U’ and WM’. My best guess for how it’d correspond is what I hinted to in the text with “Essentially, it respects people’s “free will”, treating it as an important source of information about what it should do.”, that the WM’ would contain an additional unmodifiable variable corresponding to which of the two counterfactual cases it is in, and that WM’ would then claim that this variable is what determines whether people try to shut down the AI, and that this variable also determines whether U=B or U=V.
So if it sees a human pressing the button, the policy won’t conclude that the human didn’t press the button, but instead will conclude that it is in the U=B scenario. Though also hopefully in this scenario the AI would shut down.
No. Though you could make a variant where it can think that, just by making the counterfactual more noisy.
B means “the button gets pressed and the AI shuts down”. So it can’t optimize B without shutting down. But yes, the AI would observe whether the humans try to press the shutdown button, that’s the point I made with “The AI will build an elaborate system to observe whether people want to press the stop button.”.
I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn’t think that they could mistakenly want to press the button.
You’d have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)
There’s also something weird being assumed, about it making sense to define utility functions that only care about some counterfactual worlds. (I mean this is a reasonable assumption that people make, but it seems weird in general.) Like, this seems in tension with acausal bargaining / threats. If V_f wants V, doesn’t it want what V says is good, and V might have opinions about other worlds (for example: “there shouldn’t be torture, anywhere, even in counterfactual worlds”), and so optimizing for V_f optimizes even worlds where not-f?
If V has counterfactuals that cancel out the f in Vf, then I could see the results getting pretty funky, yes. But I’m imagining that V limits itself to counterfactuals that don’t cancel out the f.