So let’s consider this from a different angle. In Hanson’s Age of Em (which I recommend) he starts his Em Scenario by making a handful of assumptions about Ems. Assumptions like:
We can’t really make meaningful changes beyond pharmacological tweaks to ems because the brain is inscrutable.
That Ems cannot be merged for the same reasons.
The purpose of these assumptions is to stop the hypothetical Em economy from immediately self modifying into something else. He tries to figure out how many doublings the Em economy will undergo before it phase transitions into a different technological regime. Critics of the book usually ask why the Em economy wouldn’t just immediately invent AGI, and Hanson has some clever cope for this where he posits a (then plausible) nominal improvement rate for AI that implies AI won’t overtake Ems until five years into the Em economy or something like this. In reality AI progress is on something like an exponential curve and that old cope is completely unreasonable.
So the first assumption of a “make uploads” plan is that you have a unipolar scenario where the uploads will only be working on alignment, or at least actively not working on AI capabilities. There is a further hidden assumption in that assumption which almost nobody thinks about, which is that there is a such thing as meaningful AI alignment progress separate from “AI capabilities” (I tend to think they have a relatively high overlap, perhaps 70%?). This is not and of itself a dealbreaker but it does mean you have a lot of politics to think about in terms of who is the unipolar power and who precisely is getting uploaded and things of this nature.
But I think my fundamental objection to this kind of thing is more like my fundamental objection to something like OpenAI’s Superalignment (or to a lesser extent PauseAI), which is that this sort of plan doesn’t really generate any intermediate bits of solution to the alignment problem until you start the search process, at which point you plausibly have too few bits to even specify a target. If we were at a place where we mostly had consensus about what the shape of an alignment solution looks like and what constitutes progress, and we mostly agreed that involved breaking our way past some brick wall like “solve the Collatz Conjecture”, I would agree that throwing a slightly superhuman AI at the figurative Collatz Conjecture is probably our best way of breaking through.
The difference between alignment and the Collatz Conjecture however is that as far as I know nobody can find any pattern to the number streams involved in the Collatz Conjecture but alignment has enough regular structure that we can stumble into bits of solution without even intending to. There’s a strain of criticism of Yudkowsky that says “you said by the time an AI can talk it will kill us, and you’re clearly wrong about that” to which Yudkowsky (begrudgingly, when he acknowledges this at all) replies “okay but that’s mostly me overestimating the difficulty of language acquisition, these talking AIs are still very limited in what they can do compared to humans, when we get AIs that aren’t they’ll kill us”. This is a fair reply as far as it goes, but it glosses over the fact that the first impossible problem, the one Bostrom 2014 Superintelligence brings up repeatedly to explain why alignment is hard, is that there is no way to specify a flexible representation of human values in the machine before the computer is already superintelligent and therefore presumed incorrigible. We now have a reasonable angle of attack on that problem. Whether you think that reasonable angle of attack implies 5% alignment progress or 50% (I’m inclined towards closer to 50%) the most important fact is that the problem budged at all. Problems that are actually impossible do not budge like that!
The Collatz Conjecture is impossible(?) because no matter what analysis you throw at the number streams you don’t find any patterns that would help you predict the result. That means you put in tons and tons of labor and after decades of throwing total geniuses at it you perhaps have a measly bit or two of hypothesis space eliminated. If you think a problem is impossible and you accidentally stumble into 5% progress, you should update pretty hard that “wait this probably isn’t impossible, in fact this might not even be that hard once you view it from the right angle”. If you shout very loudly “we have made zero progress on alignment” when some scratches in the problem are observed, you are actively inhibiting the process that might eventually solve the problem. If the generator of this ruinous take also says things like “nobody besides MIRI has actually studied machine intelligence” in the middle of a general AI boom then I feel comfortable saying it’s being driven by ego-inflected psychological goop or something and I have a moral imperative to shout “NO ACTUALLY THIS SEEMS SOLVABLE” back.
So any kind of “meta-plan” regardless of its merits is sort of an excuse to not explore the ground that has opened up and ally with the “we have made zero alignment progress” egregore, which makes me intrinsically suspicious of them even when I think on paper they would probably succeed. I get the impression that things like OpenAI’s Superalignment are advantageous because they let alignment continue to be a floating signifier to avoid thinking about the fact that unless you can place your faith in a process like CEV the entire premise of driving the future somewhere implies needing to have a target future in mind which people will naturally disagree about. Which could naturally segue into another several paragraphs about how when you have a thing people are naturally going to disagree about and you do your best to sweep that under the rug to make the political problem look like a scientific or philosophical problem it’s natural to expect other people will intervene to stop you since their reasonable expectation is that you’re doing this to make sure you win that fight. Because of course you are, duh. Which is fine when you’re doing a brain in a box in a basement but as soon as you’re transitioning into government backed bids for a unipolar advantage the same strategy has major failure modes like losing the political fight to an eternal regime of darkness that sound very fanciful and abstract until they’re not.
You initially questioned whether uploads would be aligned, but now you seem to be raising several other points which do not engage with that topic or with any of my last comment. I do not think we can reach agreement if you switch topics like this—if you now agree that uploads would be aligned, please say so. That seems to be an important crux, so I am not sure why you want to move on from it to your other objections without acknowledgement.
I am not sure I was able to correctly parse this comment, but you seem to be making a few points.
In one place, you question whether the capabilities / alignment distinction exists—I do not really understand the relevance, since I nowhere suggested pure alignment work, only uploading / emulation etc. This also seems to be somewhat in tension with the rest of your comment, but perhaps it is only an aside and not load bearing?
Your main point, as I understand it, is that alignment may actually be tractable to solve, and a focus on uploading is an excuse to delay alignment progress and then (as you seem to frame my suggestion) have an upload solve it all at once. And this does not allow incremental progress or partial solutions until uploading works.
...then you veer into speculation about the motives / psychology of MIRI and the superalignment team which is interesting but doesn’t seem central or even closely connected to the discussion at hand.
So I will focus on the main point here. I have a lot of disagreements with it.
I think you may misunderstand my plan here—you seem to characterize the idea as making uploads, and then setting them loose to either self-modify etc. or mainly to work on technical alignment. Actually, I don’t view it this way at all. Creating the uploads (or emulations, if you can get a provably safe imitation learning scheme to work faster) is a weak technical solution to the alignment problem—now you have something aligned to (some) human(’s) values which you can run 10x faster, so it is in that sense not only an aligned AGI but modestly superintelligent. You can do a lot of things with that—first of all, it automatically hardens the world significantly: it lowers the opportunity cost for not building superintelligence because now we already have a bunch of functionally genius scientists, you can drastically improve cybersecurity, and perhaps the uploads make enough money to buy up a sufficient percentage of GPU’s that whatever is left over is not enough to outcompete them even if someone creates an unaligned AGI. Another thing you can do is try to find a more scalable and general solution to the AI safety problem—including technical methods like agent foundations, interpretability, and control, as well as governance. But I don’t think of this is as the mainline path to victory in the short term.
Perhaps you are worried that uploads will recklessly self-modify or race to build AGI. I don’t think this is inevitable or even the default. There is currently no trillion dollar race to build uploads! There may be only a small number of players, and they can take precautions, and enforce regulations on what uploads are allowed to do (effectively, since uploads are not strong superintelligences) and technically it even seems hard for uploads to recursively self-improve by default (human brains are messy, they don’t even need to be given read/write access). Even some uploads escaped, to recursively self-improve safely they would need to solve their own alignment problem and it is not in their interests to recklessly forge ahead, particularly if they can be punished with shutdown and are otherwise potentially immortal. I suspect that most uploads who try to foom will go insane, and it is not clear that the power balance favors any rogue uploads who fare better.
I also don’t agree that there is no incremental progress on the way to full uploads—I think you can build useful rationality enhancing artifacts well before that points—but that is maybe worth a post.
Finally, I do not agree with this characterization of trying to build uploads rather than just solving alignment. I have been thinking about and trying to solve alignment for years, I see serious flaws in every approach, and I have recently started to wonder if alignment is just uploading with more steps anyway. So, this is more like my most promising suggestion for alignment, rather than giving up on solving alignment.
So let’s consider this from a different angle. In Hanson’s Age of Em (which I recommend) he starts his Em Scenario by making a handful of assumptions about Ems. Assumptions like:
We can’t really make meaningful changes beyond pharmacological tweaks to ems because the brain is inscrutable.
That Ems cannot be merged for the same reasons.
The purpose of these assumptions is to stop the hypothetical Em economy from immediately self modifying into something else. He tries to figure out how many doublings the Em economy will undergo before it phase transitions into a different technological regime. Critics of the book usually ask why the Em economy wouldn’t just immediately invent AGI, and Hanson has some clever cope for this where he posits a (then plausible) nominal improvement rate for AI that implies AI won’t overtake Ems until five years into the Em economy or something like this. In reality AI progress is on something like an exponential curve and that old cope is completely unreasonable.
So the first assumption of a “make uploads” plan is that you have a unipolar scenario where the uploads will only be working on alignment, or at least actively not working on AI capabilities. There is a further hidden assumption in that assumption which almost nobody thinks about, which is that there is a such thing as meaningful AI alignment progress separate from “AI capabilities” (I tend to think they have a relatively high overlap, perhaps 70%?). This is not and of itself a dealbreaker but it does mean you have a lot of politics to think about in terms of who is the unipolar power and who precisely is getting uploaded and things of this nature.
But I think my fundamental objection to this kind of thing is more like my fundamental objection to something like OpenAI’s Superalignment (or to a lesser extent PauseAI), which is that this sort of plan doesn’t really generate any intermediate bits of solution to the alignment problem until you start the search process, at which point you plausibly have too few bits to even specify a target. If we were at a place where we mostly had consensus about what the shape of an alignment solution looks like and what constitutes progress, and we mostly agreed that involved breaking our way past some brick wall like “solve the Collatz Conjecture”, I would agree that throwing a slightly superhuman AI at the figurative Collatz Conjecture is probably our best way of breaking through.
The difference between alignment and the Collatz Conjecture however is that as far as I know nobody can find any pattern to the number streams involved in the Collatz Conjecture but alignment has enough regular structure that we can stumble into bits of solution without even intending to. There’s a strain of criticism of Yudkowsky that says “you said by the time an AI can talk it will kill us, and you’re clearly wrong about that” to which Yudkowsky (begrudgingly, when he acknowledges this at all) replies “okay but that’s mostly me overestimating the difficulty of language acquisition, these talking AIs are still very limited in what they can do compared to humans, when we get AIs that aren’t they’ll kill us”. This is a fair reply as far as it goes, but it glosses over the fact that the first impossible problem, the one Bostrom 2014 Superintelligence brings up repeatedly to explain why alignment is hard, is that there is no way to specify a flexible representation of human values in the machine before the computer is already superintelligent and therefore presumed incorrigible. We now have a reasonable angle of attack on that problem. Whether you think that reasonable angle of attack implies 5% alignment progress or 50% (I’m inclined towards closer to 50%) the most important fact is that the problem budged at all. Problems that are actually impossible do not budge like that!
The Collatz Conjecture is impossible(?) because no matter what analysis you throw at the number streams you don’t find any patterns that would help you predict the result. That means you put in tons and tons of labor and after decades of throwing total geniuses at it you perhaps have a measly bit or two of hypothesis space eliminated. If you think a problem is impossible and you accidentally stumble into 5% progress, you should update pretty hard that “wait this probably isn’t impossible, in fact this might not even be that hard once you view it from the right angle”. If you shout very loudly “we have made zero progress on alignment” when some scratches in the problem are observed, you are actively inhibiting the process that might eventually solve the problem. If the generator of this ruinous take also says things like “nobody besides MIRI has actually studied machine intelligence” in the middle of a general AI boom then I feel comfortable saying it’s being driven by ego-inflected psychological goop or something and I have a moral imperative to shout “NO ACTUALLY THIS SEEMS SOLVABLE” back.
So any kind of “meta-plan” regardless of its merits is sort of an excuse to not explore the ground that has opened up and ally with the “we have made zero alignment progress” egregore, which makes me intrinsically suspicious of them even when I think on paper they would probably succeed. I get the impression that things like OpenAI’s Superalignment are advantageous because they let alignment continue to be a floating signifier to avoid thinking about the fact that unless you can place your faith in a process like CEV the entire premise of driving the future somewhere implies needing to have a target future in mind which people will naturally disagree about. Which could naturally segue into another several paragraphs about how when you have a thing people are naturally going to disagree about and you do your best to sweep that under the rug to make the political problem look like a scientific or philosophical problem it’s natural to expect other people will intervene to stop you since their reasonable expectation is that you’re doing this to make sure you win that fight. Because of course you are, duh. Which is fine when you’re doing a brain in a box in a basement but as soon as you’re transitioning into government backed bids for a unipolar advantage the same strategy has major failure modes like losing the political fight to an eternal regime of darkness that sound very fanciful and abstract until they’re not.
You initially questioned whether uploads would be aligned, but now you seem to be raising several other points which do not engage with that topic or with any of my last comment. I do not think we can reach agreement if you switch topics like this—if you now agree that uploads would be aligned, please say so. That seems to be an important crux, so I am not sure why you want to move on from it to your other objections without acknowledgement.
I am not sure I was able to correctly parse this comment, but you seem to be making a few points.
In one place, you question whether the capabilities / alignment distinction exists—I do not really understand the relevance, since I nowhere suggested pure alignment work, only uploading / emulation etc. This also seems to be somewhat in tension with the rest of your comment, but perhaps it is only an aside and not load bearing?
Your main point, as I understand it, is that alignment may actually be tractable to solve, and a focus on uploading is an excuse to delay alignment progress and then (as you seem to frame my suggestion) have an upload solve it all at once. And this does not allow incremental progress or partial solutions until uploading works.
...then you veer into speculation about the motives / psychology of MIRI and the superalignment team which is interesting but doesn’t seem central or even closely connected to the discussion at hand.
So I will focus on the main point here. I have a lot of disagreements with it.
I think you may misunderstand my plan here—you seem to characterize the idea as making uploads, and then setting them loose to either self-modify etc. or mainly to work on technical alignment. Actually, I don’t view it this way at all. Creating the uploads (or emulations, if you can get a provably safe imitation learning scheme to work faster) is a weak technical solution to the alignment problem—now you have something aligned to (some) human(’s) values which you can run 10x faster, so it is in that sense not only an aligned AGI but modestly superintelligent. You can do a lot of things with that—first of all, it automatically hardens the world significantly: it lowers the opportunity cost for not building superintelligence because now we already have a bunch of functionally genius scientists, you can drastically improve cybersecurity, and perhaps the uploads make enough money to buy up a sufficient percentage of GPU’s that whatever is left over is not enough to outcompete them even if someone creates an unaligned AGI. Another thing you can do is try to find a more scalable and general solution to the AI safety problem—including technical methods like agent foundations, interpretability, and control, as well as governance. But I don’t think of this is as the mainline path to victory in the short term.
Perhaps you are worried that uploads will recklessly self-modify or race to build AGI. I don’t think this is inevitable or even the default. There is currently no trillion dollar race to build uploads! There may be only a small number of players, and they can take precautions, and enforce regulations on what uploads are allowed to do (effectively, since uploads are not strong superintelligences) and technically it even seems hard for uploads to recursively self-improve by default (human brains are messy, they don’t even need to be given read/write access). Even some uploads escaped, to recursively self-improve safely they would need to solve their own alignment problem and it is not in their interests to recklessly forge ahead, particularly if they can be punished with shutdown and are otherwise potentially immortal. I suspect that most uploads who try to foom will go insane, and it is not clear that the power balance favors any rogue uploads who fare better.
I also don’t agree that there is no incremental progress on the way to full uploads—I think you can build useful rationality enhancing artifacts well before that points—but that is maybe worth a post.
Finally, I do not agree with this characterization of trying to build uploads rather than just solving alignment. I have been thinking about and trying to solve alignment for years, I see serious flaws in every approach, and I have recently started to wonder if alignment is just uploading with more steps anyway. So, this is more like my most promising suggestion for alignment, rather than giving up on solving alignment.