that involves modeling the overseer and the other sub-AIs (each of which is also modeling the overseer and other sub-AIs) and coming up with arguments that can convince the modeled overseer/AIs, etc
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.
Re paragraph 1:
I don’t think this is necessary. Let’s go back to the alien analogy. Let’s assume that you have an action that effectively implements U = “acquire resources; don’t be evil; leave the user in control.” I claimed that pursuing U is not much harder than acquiring resources on its own; for the moment can we grant that, since it seems orthogonal to the issues in this paragraph?
So you have this good action, and you are deciding what to do. I claim that you should just do it, with no extra computation. Then, if it gets evaluated, you will have to think about rhetorical strategies to use when convincing the overseer its a good idea. Also, a copy of you will have to think about how to convince the overseer that it’s a bad idea. And the overseer will need to actually think about the issue. And so on.
There are some issues that you’ll need to think about before you commit to the action. For example, if there is a knockdown argument that this action is a bad idea, then you shouldn’t do it. But that thinking seems to be a necessary part of maximizing U, not an extra step (unless the arguing system is broken).
You might be able to do even better by thinking through all of the arguments in advance, and taking an action which is not merely good but which will fare especially well during an evaluation. But if the extra computational costs of doing so are significant, this has to be weighed against the benefits of actually finding a better plan. It seems like this basically comes down to the question “If you take the very most efficient plan, and we perform some elaborate procedure for arguing, can you actually ‘convince’ the overseer that it’s a good plan?”
If the answer is “no” then you are screwed anyway, before we even start thinking about computational overhead—you simply can’t take the best plan, because you wouldn’t be able to receive a good rating if you do. And if the answer is “yes” then the problem seems to go away. The justifiability of good plans does seem to be a key question about the feasibility of this scheme, which I am happy to discuss if it’s underlying objection you have in mind.