-- The AI isn’t trying to deceive you, but is trying to produce plans that would, if executed, have consequences X, and X is not something you want.
-- The AI is trying to produce plans that would, if executed, have consequences you want.
The first case is hopeless, and the third case is about an already aligned AI. The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up. But if, arguendo, we have a superintelligently optimized plan which doesn’t already contain, in its current description as a plan, a mindhack (e.g. by some surprising way of domaining an AI to care about producing plans but not about making anything happen), then there’s a question whether it could help to have humans think about the consequences of the plan. I thought Eliezer was answering that question “No, even in this hypothetical, pivotal acts are too complicated and can’t be understood fully in detail by humans, so you’d still have to trust the AI, so the AI has to have understood and applied a whole lot about your values in order to have any shot that the plan doesn’t have huge unpleasantly surprising consequences”, and I was questioning that.
Not a response to your actual point but
I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan)
If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans.
(or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)
The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place.
Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.
So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.
I think I agree, but also, people say things like “the AI should if possible be prevented from not modeling humans”, which if possible would imply that the hypothetical example makes more sense.
The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up.
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn’t trying to deceive you, and actually do what you asked it to do (e.g., I want to see “the diamond” on the main detector), yet the plans it produces has consequence X that you don’t want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don’t even know X is a possible consequence of the plans?
We can distinguish:
-- The AI is trying to deceive you.
-- The AI isn’t trying to deceive you, but is trying to produce plans that would, if executed, have consequences X, and X is not something you want.
-- The AI is trying to produce plans that would, if executed, have consequences you want.
The first case is hopeless, and the third case is about an already aligned AI. The second case might not really make sense, because deception is a convergent instrumental goal especially if the AI is trying to cause X and you’re trying to cause not X, and generally because an AI that smart probably has inner optimizers that don’t care about this “make a plan, don’t execute plans” thing you thought you’d set up. But if, arguendo, we have a superintelligently optimized plan which doesn’t already contain, in its current description as a plan, a mindhack (e.g. by some surprising way of domaining an AI to care about producing plans but not about making anything happen), then there’s a question whether it could help to have humans think about the consequences of the plan. I thought Eliezer was answering that question “No, even in this hypothetical, pivotal acts are too complicated and can’t be understood fully in detail by humans, so you’d still have to trust the AI, so the AI has to have understood and applied a whole lot about your values in order to have any shot that the plan doesn’t have huge unpleasantly surprising consequences”, and I was questioning that.
Not a response to your actual point but I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan) If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)
The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place. Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.
So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.
I think I agree, but also, people say things like “the AI should if possible be prevented from not modeling humans”, which if possible would imply that the hypothetical example makes more sense.
I believe the second case is a subcase of the problem of ELK. Maybe the AI isn’t trying to deceive you, and actually do what you asked it to do (e.g., I want to see “the diamond” on the main detector), yet the plans it produces has consequence X that you don’t want (in the ELK example, the diamond is stolen but you see something that looks like the diamond on the main detector). The problem is: how can you be sure the plans proposed have consequence X? Especially if you don’t even know X is a possible consequence of the plans?