like the fact that any control technique on AI would be illegal because of it being essentially equivalent to brainwashing, such that I consider AIs much more alignable than humans
A lot of (most?) humans end up nice without needing to be controlled / “aligned”, and I don’t particularly expect this to break if they grow up smarter. Trying to control / “align” them wouldn’t work anyway, which is also what I predict will happen with sufficiently smart AI.
I think this is my disagreement, in that I don’t think that most humans are in fact nice/aligned to each other by default, and the reason why this doesn’t lead to catastrophe broadly speaking is a combo of being able to rely on institutions/mechanism design that means even if people are misaligned, you can still get people well off under certain assumptions (capitalism and the rule of law being one such example), combined with the inequalities not being so great that individual humans can found their own societies, except in special cases.
Even here, I’d argue that human autocracies are very often misaligned to their citizens values very severely.
To be clear about what I’m not claiming, I’m not saying that alignment is worthless, or alignment always or very often fails, because it’s consistent with a world where >50-60% of alignment attempts are successful.
This means I’m generally much more scared of very outlier smart humans, for example a +7-12 SD human that was in power of a large group of citizens, assuming no other crippling disabilities unless they are very pro-social/aligned to their citizenry.
I’m not claiming that alignment will not work, or even that will very often not work, but rather that the chance of failure is real and the stakes are quite high long-term.
(And that’s not even addressing how you could get super-smart people to work on the alignment problem).
This is just a definition for the sake of definition, but I think you could define a human as aligned if they could be given an ASI slave and not be an S-risk. I really think that under this definition, the absolute upper bound of “aligned” humans is 5%, and I think it’s probably a lot lower.
I’m more optimistic, in that the upper bound could be as high as 50-60%, but yeah the people in power are unfortunately not part of this, and I’d only trust 25-30% of the population in practice if they had an ASI slave.
So you think that, for >95% of currently living humans, the implementation of their CEV would constitute an S-risk in the sense of being worse than extinction in expectation? This is not at all obvious to me; in what way do you expect their CEVs to prefer net suffering?
(And that’s not even addressing how you could get super-smart people to work on the alignment problem).
I mean if we actually succeeded at making people who are +7 SD in a meaningful way, I’d expect that at least good chunk of them would figure out for themselves that it makes sense to work on it.
That requires either massive personality changes to make them more persuadable, or massive willingness of people to put genetic changes in their germline, and I don’t expect either of these to happen before AI automates everything and either takes over, leaving us extinct or humans/other AI control/align AIs successfully.
(A key reason for this is that Genesmith admitted that the breakthroughs in germline engineering can’t transfer to the somatic side, and that means we’d have to wait 25-30 years in order for it to grow, minimum given that society won’t maximally favor the genetically lucky, and that’s way beyond most plausible AI timelines at this point)
Because they might consider that other problems are more worth their time, since smartness changes change their values little.
And maybe they believe that AI alignment isn’t impactful for technical/epistemic reasons.
I’m confused/surprised I need to make this point, because I don’t automatically think they will be persuaded that AI alignment is a big problem they will need to work on, and some effort will likely still need to be required.
Because they might consider that other problems are more worth their time, since smartness changes change their values little.
I mean if they care about solving problems at all, and we are in fact correct about AGI ruin, then they should predictably come to view it as the most important problem and start to work on it?
Are you imagining they’re super myopic or lazy and just want to think about math puzzles or something? If so, my reply is that even if some of them ended up like that, I’d be surprised if they all ended up like that, and if so that would be a failure of the enhancement. The aim isn’t to create people who we will then carefully persuade to work on the problem, the aim is for some of them to be smart + caring + wise enough to see the situation we’re in and decide for themselves to take it on.
More so that I’m imagining they might not even have heard of the argument, and it’s helpful to note that people like Terence Tao, Timothy Gowers and more are all excellent people at their chosen fields, but most people that have a big impact on the world don’t go into AI alignment.
Remember, superintelligence is not omniscience.
So I don’t expect them to be self motivated to work on this specific problem without at least a little persuasion.
I’d expect a few superintelligent adults to join alignment efforts, but nowhere near thousands or tens of thousands, and I’d upper bound it at 300-500 new researchers at most in 15-25 years.
How much probability do you assign to automating AI safety not working in time? Because I believe the preparing to automate AI safety work is probably the highest-value in pure ability to reduce X-risk probability, assuming it does work, so I assign much higher EV to automating AI safety, relative to other approaches.
I think I’m at <10% that non-enhanced humans will be able to align ASI in time, and if I condition on them succeeding somehow I don’t think it’s because they got AIs to do it for them. Like maybe you can automate some lower level things that might be useful (e.g. specific interpretability experiments), but at the end of the day someone has to understand in detail how the outcome is being steered or they’re NGMI. Not sure exactly what you mean by “automating AI safety”, but I think stronger forms of the idea are incoherent (e.g. “we’ll just get AI X to figure it all out for us” has the problem of requiring X to be aligned in the first place).
As far as what a plan to automate AI safety would work out in practice, assuming a relatively strong version of the concept is in this post below, and there will be another post that comes out by the same author talking more about the big risks discussed in the comments below:
In general, I think the crux is that in most timelines (at a lower bound, 65-70%) that have AGI developed relatively soon (so timelines from 2030-2045, roughly), and the alignment problem isn’t solvable by default/it’s at least non-trivially tricky to solved, conditioning on alignment success looks more like “we’ve successfully figured out how to prepare for AI automation of everything, and we managed to use alignment and control techniques well enough that we can safely pass most of the effort to AI”, rather than other end states like “humans are deeply enhanced” or “lawmakers actually coordinated to pause AI, and are actually giving funding to alignment organizations such that we can make AI safe.”
A lot of (most?) humans end up nice without needing to be controlled / “aligned”, and I don’t particularly expect this to break if they grow up smarter. Trying to control / “align” them wouldn’t work anyway, which is also what I predict will happen with sufficiently smart AI.
I think this is my disagreement, in that I don’t think that most humans are in fact nice/aligned to each other by default, and the reason why this doesn’t lead to catastrophe broadly speaking is a combo of being able to rely on institutions/mechanism design that means even if people are misaligned, you can still get people well off under certain assumptions (capitalism and the rule of law being one such example), combined with the inequalities not being so great that individual humans can found their own societies, except in special cases.
Even here, I’d argue that human autocracies are very often misaligned to their citizens values very severely.
To be clear about what I’m not claiming, I’m not saying that alignment is worthless, or alignment always or very often fails, because it’s consistent with a world where >50-60% of alignment attempts are successful.
This means I’m generally much more scared of very outlier smart humans, for example a +7-12 SD human that was in power of a large group of citizens, assuming no other crippling disabilities unless they are very pro-social/aligned to their citizenry.
I’m not claiming that alignment will not work, or even that will very often not work, but rather that the chance of failure is real and the stakes are quite high long-term.
(And that’s not even addressing how you could get super-smart people to work on the alignment problem).
This is just a definition for the sake of definition, but I think you could define a human as aligned if they could be given an ASI slave and not be an S-risk. I really think that under this definition, the absolute upper bound of “aligned” humans is 5%, and I think it’s probably a lot lower.
I’m more optimistic, in that the upper bound could be as high as 50-60%, but yeah the people in power are unfortunately not part of this, and I’d only trust 25-30% of the population in practice if they had an ASI slave.
What would it mean for them to have an “ASI slave”? Like having an AI that implements their personal CEV?
Yeah something like that, the ASI is an extension of their will.
So you think that, for >95% of currently living humans, the implementation of their CEV would constitute an S-risk in the sense of being worse than extinction in expectation? This is not at all obvious to me; in what way do you expect their CEVs to prefer net suffering?
I mean if we actually succeeded at making people who are +7 SD in a meaningful way, I’d expect that at least good chunk of them would figure out for themselves that it makes sense to work on it.
That requires either massive personality changes to make them more persuadable, or massive willingness of people to put genetic changes in their germline, and I don’t expect either of these to happen before AI automates everything and either takes over, leaving us extinct or humans/other AI control/align AIs successfully.
(A key reason for this is that Genesmith admitted that the breakthroughs in germline engineering can’t transfer to the somatic side, and that means we’d have to wait 25-30 years in order for it to grow, minimum given that society won’t maximally favor the genetically lucky, and that’s way beyond most plausible AI timelines at this point)
If they’re that smart, why will they need to be persuaded?
Because they might consider that other problems are more worth their time, since smartness changes change their values little.
And maybe they believe that AI alignment isn’t impactful for technical/epistemic reasons.
I’m confused/surprised I need to make this point, because I don’t automatically think they will be persuaded that AI alignment is a big problem they will need to work on, and some effort will likely still need to be required.
I mean if they care about solving problems at all, and we are in fact correct about AGI ruin, then they should predictably come to view it as the most important problem and start to work on it?
Are you imagining they’re super myopic or lazy and just want to think about math puzzles or something? If so, my reply is that even if some of them ended up like that, I’d be surprised if they all ended up like that, and if so that would be a failure of the enhancement. The aim isn’t to create people who we will then carefully persuade to work on the problem, the aim is for some of them to be smart + caring + wise enough to see the situation we’re in and decide for themselves to take it on.
More so that I’m imagining they might not even have heard of the argument, and it’s helpful to note that people like Terence Tao, Timothy Gowers and more are all excellent people at their chosen fields, but most people that have a big impact on the world don’t go into AI alignment.
Remember, superintelligence is not omniscience.
So I don’t expect them to be self motivated to work on this specific problem without at least a little persuasion.
I’d expect a few superintelligent adults to join alignment efforts, but nowhere near thousands or tens of thousands, and I’d upper bound it at 300-500 new researchers at most in 15-25 years.
Much less impactful than automating AI safety.
I don’t think this will work.
How much probability do you assign to automating AI safety not working in time? Because I believe the preparing to automate AI safety work is probably the highest-value in pure ability to reduce X-risk probability, assuming it does work, so I assign much higher EV to automating AI safety, relative to other approaches.
I think I’m at <10% that non-enhanced humans will be able to align ASI in time, and if I condition on them succeeding somehow I don’t think it’s because they got AIs to do it for them. Like maybe you can automate some lower level things that might be useful (e.g. specific interpretability experiments), but at the end of the day someone has to understand in detail how the outcome is being steered or they’re NGMI. Not sure exactly what you mean by “automating AI safety”, but I think stronger forms of the idea are incoherent (e.g. “we’ll just get AI X to figure it all out for us” has the problem of requiring X to be aligned in the first place).
As far as what a plan to automate AI safety would work out in practice, assuming a relatively strong version of the concept is in this post below, and there will be another post that comes out by the same author talking more about the big risks discussed in the comments below:
https://www.lesswrong.com/posts/TTFsKxQThrqgWeXYJ/how-might-we-safely-pass-the-buck-to-ai
In general, I think the crux is that in most timelines (at a lower bound, 65-70%) that have AGI developed relatively soon (so timelines from 2030-2045, roughly), and the alignment problem isn’t solvable by default/it’s at least non-trivially tricky to solved, conditioning on alignment success looks more like “we’ve successfully figured out how to prepare for AI automation of everything, and we managed to use alignment and control techniques well enough that we can safely pass most of the effort to AI”, rather than other end states like “humans are deeply enhanced” or “lawmakers actually coordinated to pause AI, and are actually giving funding to alignment organizations such that we can make AI safe.”