I just got struck by an idea that seems to obvious, to naive, to possibly be true, and which horrified me causing my brain to throw a huge batch or rationalizations at it to stop me from believing something as obviously low status. I’m currently very undecided, but sich it seems like the thing I can’t handle on my own I’ll just leave a transcript of my uncensored internal monologue here:
What volition do I want to extrapolate
MY one, tautologically
But Eliezer, the great leader who is way, way way smarter than you said you shouldn’t and thinking that was evil!
He also said you shouldn’t just trust him like a great leader and reason from authority like that.
But you want to maximize the CEV of humanity!
In that case, it doesn’t matter which is used because they are identical. And I might be wrong about them being identical, in wich case I want my real preferences used, and while concluding that with THIS brain would not be safe, for a CEV superintelligence it would.
Others will try to stop you!
UDT. their volitions would be taken into account in proportion to their power to stop me, such that they would using this same reasoning would be better of helping me.
If everyone does that you’ll be worse of!
Again, UDT. Whatever CEV gets implemented in the end will take into account all arguments of this sort and modify itself into whatever I SHOULD have made in the first place.
You are an evil bad poopyhead!
If I am evil, I want to believe I am evil, and if I am nice, I want to believe I am nice. Or maybe I just want to believe I’m nice regardless but have the AI implement my evil preferences anyway.
Please note that I do not endorse my every though, and probably will regret posting this in the morning. As you can see, I’m to tired to even correct this obvious contradiction in my beliefs, and to tired to care I know that I believe every statement is true because I believe I believe a contradiction and I believe contradictions imply all statements being true. Or spelling properly.
Leaving all the in-group/out-group anxiety aside, and assuming I were actually in a position where I get to choose whose volition to extrapolate, there’s three options:
…humanity’s extrapolated volition is inconsistent with mine (in which case I get less of what I want by using humanity’s judgement rather than my own), ...HEV is consistent with, but different from, mine (in which case I get everything I want either way), or ...HEV is identical to mine (in which case I get everything I want either way).
So HEV ⇐ mine.
That said, others more reliably get more of what they want using HEV than using mine, which potentially makes it easier to obtain their cooperation if they think I’m going to use HEV. So I should convince them of that.
But they’d prefer just the CEV of you two to the one of all humanity, and the same goes for each single human who’d raise that objection. The end result is the CEV of you+everyone hat could have stopped you. And this dosn’t need handling before you make it either: I’m pretty sure it arises naturally from TDT if you implement your own and were only able to do so because you used this argument on a bunch of people.
I just got struck by an idea that seems to obvious, to naive, to possibly be true, and which horrified me causing my brain to throw a huge batch or rationalizations at it to stop me from believing something as obviously low status. I’m currently very undecided, but sich it seems like the thing I can’t handle on my own I’ll just leave a transcript of my uncensored internal monologue here:
What volition do I want to extrapolate
MY one, tautologically
But Eliezer, the great leader who is way, way way smarter than you said you shouldn’t and thinking that was evil!
He also said you shouldn’t just trust him like a great leader and reason from authority like that.
But you want to maximize the CEV of humanity!
In that case, it doesn’t matter which is used because they are identical. And I might be wrong about them being identical, in wich case I want my real preferences used, and while concluding that with THIS brain would not be safe, for a CEV superintelligence it would.
Others will try to stop you!
UDT. their volitions would be taken into account in proportion to their power to stop me, such that they would using this same reasoning would be better of helping me.
If everyone does that you’ll be worse of!
Again, UDT. Whatever CEV gets implemented in the end will take into account all arguments of this sort and modify itself into whatever I SHOULD have made in the first place.
You are an evil bad poopyhead!
If I am evil, I want to believe I am evil, and if I am nice, I want to believe I am nice. Or maybe I just want to believe I’m nice regardless but have the AI implement my evil preferences anyway.
Please note that I do not endorse my every though, and probably will regret posting this in the morning. As you can see, I’m to tired to even correct this obvious contradiction in my beliefs, and to tired to care I know that I believe every statement is true because I believe I believe a contradiction and I believe contradictions imply all statements being true. Or spelling properly.
Leaving all the in-group/out-group anxiety aside, and assuming I were actually in a position where I get to choose whose volition to extrapolate, there’s three options: …humanity’s extrapolated volition is inconsistent with mine (in which case I get less of what I want by using humanity’s judgement rather than my own),
...HEV is consistent with, but different from, mine (in which case I get everything I want either way), or
...HEV is identical to mine (in which case I get everything I want either way).
So HEV ⇐ mine.
That said, others more reliably get more of what they want using HEV than using mine, which potentially makes it easier to obtain their cooperation if they think I’m going to use HEV. So I should convince them of that.
But they’d prefer just the CEV of you two to the one of all humanity, and the same goes for each single human who’d raise that objection. The end result is the CEV of you+everyone hat could have stopped you. And this dosn’t need handling before you make it either: I’m pretty sure it arises naturally from TDT if you implement your own and were only able to do so because you used this argument on a bunch of people.