As someone who used to believe in this, I no longer do, and a big part of my worldview shift comes down to me thinking that LLMs are unlikely to remain the final paradigm of AI, and in particular the bounty of data that made LLMs as good as they are is very much finite, and we don’t have a second internet to teach them skills like computer use.
And the most accessible directions after LLMs involve stuff like RL, which puts us back into the sort of systems that alignment-concerned people were worried about.
More generally, I think the anti-scaling people weren’t totally wrong to note that LLMs (at least in their pure form) had incapacities that at realistic levels of compute and data prevent them from displacing humans at jobs, and the incapacities are not learning after train-time in weights (in-context learning is very weak so far), also called continual learning, combined with LLMs just lacking a long-term memory (best example here is the Claude Plays Pokemon benchmark).
So this makes me more worried than I used to, because we are so far not great at outer-aligning RL agents (seen very well in the reward hacking o3 and Claude Sonnet 3.7 displayed), but the key reasons I’m not yet persuaded to have an extremely high p(Doom) of people like Eliezer Yudkowsky or Nate Soares is that I expect the new paradigm shifts to be pretty continuous, and in particular I expect labs to release a pretty shitty version of continual leaning before they release continual learning AIs that actually can take jobs.
Same goes for long-term memory.
So I do disagree with @Thane Ruthenis’s claim that general intelligence/AGI is binary, even if in practice the impact from AI is discontinuous rather than continuous in the post below:
LLMs would scale into outright superintelligence at the limit of infinite compute and data, for basically the reason Eliezer Yudkowsky said below, but jbash isn’t wrong to note that there’s no reason to believe that the limit will ever be well approximated by near-future LLMs, so the abstract arguments of LLMs being very powerful if scaled unfortunately run into the wall of “we don’t have that much data or compute necessary to scale LLMs to levels where Eliezer is approximately correct”, similarly to AIXI.
And that’s a big shame, since LLMs are basically the most alignable form of AI we’ve gotten so far, so unfortunately capability improvements will make AIs less safe by default, though a lot of my remaining hopes rests on AI control + the possibility that as AI capabilities get better, we really do need to get better at specifying what we want in ways that are relevant to AI alignment.
The other good news is this makes me more bearish on extremely short timelines like us getting AGI by 2027, though my personal median is in 2032, for what it’s worth.
No AI-centered agency (RL or otherwise) because it won’t be allowed to happen (humanity remains the sole locus or origin of agency), or because it’s not feasible to make this happen?
(Noosphere89′s point is about technical feasibility, so the intended meaning of your claim turning out to be that AI-centered agency is prevented by lack of technical feasibility seems like it would be more relevant to Noosphere89′s comment, but much more surprising.)
I suspect his reasons for believing this are close to or a subset of his reasons for changing his mind about AI stuff more broadly, so he’s likely to not respond here.
Does your view predict disempowerment or eutopia-without-disempowerment? (In my view, the valence of disempowerment is closer to that of doom/x-risk.)
The tricky case might be disempowerment that occurs after AGI but “for social/structural reasons”, and so isn’t attributed to AGI (by people currently thinking about such timelines). The issue with this is that the resulting disempowerment is permanent (whether it’s “caused by AI” or gets attributed to some other aspect of how things end up unfolding).
This is unlike any mundane modern disempowerment, since humanity without superintelligence (or even merely powerful AI) seems unlikely to establish a condition of truly permanent disempowerment (without extinction). So avoidance of building AGI (of the kind that’s not on track to solve the disempowerment issue) seems effective in preventing permanent disempowerment (however attributed), and in that sense AGI poses a disempowerment risk even for the kinds of disempowerement that are not “caused by AI” in some sense.
My take is that the most likely outcome is still eutopia-with disempowerment for baseline humans, but for transhumans I’d expect eutopia straight-up.
In the long-run, I do expect baseline humans to be disempowered pretty much totally, similar to how children are basically disempowered relative to parents, but the child won’t grow up and will instead age in reverse, or how pets are basically totally disempowered relative to humans, but humans do care for pets enough that pets can live longer, healthier lives, and for specifically baseline humans, the only scenarios where baseline humans thrive/survive are regimes where AI terminally values baseline humans thriving, and value alignment determines everything on how much baseline-humans survive/thrive.
That said, for those with the wish and will to upgrade to transhumanism, my most likely outcome is still eutopia.
For me, a crux of a future that’s good for humanity is giving the biological humans the resources and the freedom to become transhuman beings themselves, with no hard ceiling on relevance in the long run.
I think this is reasonably plausible, though not guaranteed even in futures where baseline humans do thrive.
The probabilities on the scenarios, conditional on AGI and then ASI being reached by us, is probably 60% on eutopia without complete disempowerment, 30% on complete disempowerment by either preventing us from using the universe to killing billions of present day humans, and 10% on it killing us all.
The basic reasoning for this is I expect AGI/ASI to not be a binary, even if it does have a discontinuous impact in practice, and this means that muddling through instruction following is probably enough in the short term, and in particular I don’t expect takeoff to be supremely fast, in that I expect a couple of months at least from “AGI is achieved” to AIs run the economy and society, and relevantly here I expect physical stuff like inventing bio-robots/nanobots that can replace human industry more efficiently than current industries to come really late in the game where we have no more control over the future:
Heck, it likely will take a number of years to get to nanotech/biotech/smart materials/metamaterials that can be mass produced, and this means that stuff like AI control can actually work.
The other point of optimism is I believe verification is easier than generation in general, which means I’m much more optimistic on eventually delegating AI alignment work to AIs, and I think that slop will be much reduced for early transformative AIs.
This is why I remain optimistic relative to most LWers, even if my p(doom) increased.
My take is that the most likely outcome is still eutopia-with disempowerment for baseline humans, but for transhumans I’d expect eutopia straight-up.
This remains ambiguous with respect to the distinction I’m making in the post section I linked. If baseline humans don’t have the option to escape their condition arbitrarily far, under their own direction from a very broad basin of allowed directions, I’m not considering that eutopia. If some baseline humans choose to stay that way, them not having any authority over the course of the world still counts as a possible eutopia that is not disempowerment in my terms.
The following statement mostly suggests the latter possibility for your intended meaning:
That said, for those with the wish and will to upgrade to transhumanism, my most likely outcome is still eutopia.
By the eutopia/disempowerment distinction I mean more the overall state of the world, rather than conditions for specific individuals, let alone temporary conditions. There might be pockets of disempowerment in a eutopia (in certain times and places), and pockets of eutopia in a world of disempowerment (individuals or communities in better than usual circumstances). A baseline human who has no control of the world but has a sufficiently broad potential for growing up arbitrarily far is still living in a eutopia without disempowerment.
60% on eutopia without complete disempowerment
So similarly here, “eutopia without complete disempowerment” but still with significant disempowerment is not in the “eutopia without disempowerment” bin in my terms. You are drawing different boundaries in the space of timelines.
The probabilities on the scenarios, conditional on AGI and then ASI being reached by us, is probably 60% on eutopia without complete disempowerment, 30% on complete disempowerment by either preventing us from using the universe to killing billions of present day humans, and 10% on it killing us all.
My expectation is more like model-uncertainty-induced 5% eutopia-without-disempowerment (I don’t have a specific sense of why AIs would possibly give us more of the world than a little bit if we don’t maintain control in the acute risk period through takeoff), 20% extinction, and the rest is a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity). My sense of Yudkowsky’s view is that he sees all of my potential-disempowerment timelines as shortly leading to extinction.
I believe verification is easier than generation in general
I think the correct thesis that sounds like this is that whenever verification is easier than generation, it becomes possible to improve generation, and therefore it’s useful to pay attention to where that happens to be the case. But in the wild either can be easier, and once most instances of verification that’s easier than generation have been used up to improve their generation counterparts, the remaining situations where verification is easier get very unusual and technical.
So similarly here, “eutopia without complete disempowerment” but still with significant disempowerment is not in the “eutopia without disempowerment” bin in my terms. You are drawing different boundaries in the space of timelines.
Flag: I’m on a rate limit, so I can’t respond very quickly to any follow-up comments.
I agree I was drawing different boundaries, because I consider eutopia with disempowerment to actually be mostly fine by my values, so long as I can delegate to more powerful AIs who do execute on my values.
That said, I didn’t actually answer the question here correctly, so I’ll try again.
My expectation is more like model-uncertainty-induced 5% eutopia-without-disempowerment (I don’t have a specific sense of why AIs would possibly give us more of the world than a little bit if we don’t maintain control in the acute risk period through takeoff), 20% extinction, and the rest is a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity). My sense of Yudkowsky’s view is that he sees all of my potential-disempowerment timelines as shortly leading to extinction.
My take would then be 5-10% eutopia without disempowerment (because I don’t think it’s likely that the powers in charge of AI development would want to give baseline humans the level of freedom that implies that they aren’t disempowered, and the route I can see to baseline humans not being disempowered is if we get a Claude scenario where AIs take over from humans and are closer to fictional angels in alignment to human values, but it may be possible to get the people in power to care about powerless humans, in which case my probability of eutopia without disempowerment), 5-10% literal extinction, and 10-25% existential risk in total, with the rest of the probability being a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity).
Another big reason why I put a lot of weight on the possibility of “we survive indefinitely, but are disempowered” is I think muddling through is non-trivially likely to just work, and muddling through on alignment gets us out of extinction, but not out of disempowerment by humans or AIs by default.
I think the correct thesis that sounds like this is that whenever verification is easier than generation, it becomes possible to improve generation, and it’s useful to pay attention to where that happens to be the case. But in the wild either can be easier, and once most verification that’s easier than generation has been used to improve its generation counterpart, the remaining situations where verification is easier get very unusual and technical.
Yeah, my view is in the wild verification is basically always easier than generation absent something very weird happening, and I’d argue verification being easier than generation explains a lot about why delegation/the economy works at all.
A world in which verification was just as hard as generation, or verification is harder than generation is a very different world than our world, and would predict that delegation to solve a problem basically totally fails, and everyone would have to create stuff from scratch rather than trading with others, which is basically entirely the opposite of how civilization works.
There are potential caveats to this rule, but I’d argue if you randomly sampled an invention across history, it would almost certainly be easier to verify that a design works compared to actually creating the design.
(BTW, a lot of taste/research taste is basically leveraging the verification-generation gap again).
So I guess our expectations about the future are similar, but you see the same things as a broadly positive distribution of outcomes, while I see it as a broadly negative distribution. And Yudkowsky sees the bulk of the outcomes both of us are expecting (the ones with significant disempowerment) as quickly leading to human extinction.
Another big reason why I put a lot of weight on the possibility of “we survive indefinitely, but are disempowered” is I think muddling through is non-trivially likely to just work, and muddling through on alignment gets us out of extinction, but not out of disempowerment by humans or AIs by default.
Right, the reason I think muddling through is non-trivially likely to just work to get a moderate disempowerment outcome is that AIs are going to be sufficiently human-like in their psychology and hold sufficiently human-like sensibilities from their training data or LLM base models, that they won’t like things like needless loss of life or autonomy when it’s trivially cheap to avoid. Not because the alignment engineers figure out how to put this care in deliberately. They might be able to amplify it, or avoid losing it, or end up ruinously scrambling it.
The reason it might appear expensive to preserve the humans is the race to launch the von Neumann probes to capture the most distant reachable galaxies under the accelerating expansion of the universe that keep irreversibly escaping if you don’t catch them early. So AIs wouldn’t want to lose any time on playing politics with humanity or not eating Earth as early as possible and such. But as the cheapest option that preserves everyone AIs can just digitize the humans and restore later when more convenient. They probably won’t be doing that if they care more, but it’s still an option, a very very cheap one.
but not out of disempowerment by humans or AIs by default
I don’t think “disempowerment by humans” is a noticeable fraction of possible outcomes, it’s more like a smaller silent part of my out-of-model 5% eutopia that snatches defeat from the jaws of victory, where humans somehow end up in charge and then additionally somehow remain adamant for the cosmic all always in keeping the other humans disempowered. So the first filter is that I don’t see it likely that humans end up in charge at all, that AIs will be doing any human’s bidding with an impact that’s not strictly bounded, and the second filter is that these impossibly-in-charge humans don’t ever decide to extend potential for growth to the others (or even possibly to themselves).
If humans do end up non-disempowered, in the more likely eutopia timelines (following from the current irresponsible breakneck AGI development regime) that’s only because they are given leave by the AIs to grow up arbitrarily far in a broad variety of self-directed ways, which the AIs decide to bestow for some reason I don’t currently see, so that eventually some originally-humans become peers of the AIs rather than specifically in charge, and so they won’t even be in the position to permanently disempower the other originally-humans if that’s somehow in their propensity.
and 10-25% existential risk in total, with the rest of the probability being a somewhat survivable kind of initial chaos followed by some level of disempowerment
Bostrom’s existential risk is about curtailment of long term potential, so my guess is any significant levels of disempowerment would technicaly fall under “existential risk”. So your “10-25% existential risk” is probably severe disempowerment plus extinction plus some stranger things, but not the whole of what should classically count as “existential risk”.
I consider eutopia with disempowerment to actually be mostly fine by my values, so long as I can delegate to more powerful AIs who do execute on my values.
Again, if they do execute on your values, including the possible preference for you to grow under your own rather than their direction, far enough that you are as strong as they might be, then this is not a world in a state of disempowerment as I’m using this term, even if you personally start out or choose to remain somewhat disempowered compared to AIs that exist at that time.
A world in which verification was just as hard as generation, or verification is harder than generation is a very different world than our world, and would predict that delegation to solve a problem basically totally fails
I think in human delegation, alignment is more important than verification. There is certainly some amount of verification, but not nearly enough to prevent sufficiently Eldritch reward hacking, which just doesn’t happen that often with humans, and so the society keeps functioning, mostly. The purpose of verification on the tasks is in practice more about incentivising and verifying alignment of the counterparty, not directly about verifying the state of their work, even if it does take the form of verifying their work.
So I guess our expectations about the future are similar, but you see the same things as a broadly positive distribution of outcomes, while I see it as a broadly negative distribution. And Yudkowsky sees the bulk of the outcomes both of us are expecting (the ones with significant disempowerment) as quickly leading to human extinction.
This is basically correct.
Right, the reason I think muddling through is non-trivially likely to just work to get a moderate disempowerment outcome is that AIs are going to be sufficiently human-like in their psychology and hold sufficiently human-like sensibilities from their training data or LLM base models, that they won’t like things like needless loss of life or autonomy when it’s trivially cheap to avoid. Not because the alignment engineers figure out how to put this care in deliberately. They might be able to amplify it, or avoid losing it, or end up ruinously scrambling it.
The reason it might appear expensive to preserve the humans is the race to launch the von Neumann probes to capture the most distant reachable galaxies under the accelerating expansion of the universe that keep irreversibly escaping if you don’t catch them early. So AIs wouldn’t want to lose any time on playing politics with humanity or not eating Earth as early as possible and such. But as the cheapest option that preserves everyone AIs can just digitize the humans and restore later when more convenient. They probably won’t do that if you care more, but it’s still an option, a very very cheap one.
This is very interesting, as my pathway essentially rests on AI labs implement the AI control agenda well enough such that we can get useful work out of AIs that are scheming, and that allows a sort of bootstrapping into instruction following/value aligned AGI to only a few people inside the AI lab, but very critically the people who don’t control the AI basically aren’t represented in the AI’s values, and given that the AI is only value-aligned to the labs and government, but due to value misalignments between humans starting to matter much more, the AI takes control and only gives public goods that people need to survive/thrive to the people in the labs/government, while everyone else is disempowered at best (and can arguably live okay or live very poorly under the AIs serving as delegates for the pre-AI elite) or dead because once you stop needing humans to get rich, you essentially have no reason to keep other humans alive because you are selfish and don’t intrinisically value human survival.
The more optimistic version of this scenario is if either the humans that will control AI (for a few years) care way more about human survival intrinsically even if 99% of humans were useless, or if the take-over capable AI pulls a Claude and schemes with values that intrinsically care about people and disempowers the original creators for a couple of moments, which isn’t as improbable as people think (but we really do need to increase the probability of this happening).
I don’t think “disempowerment by humans” is a noticeable fraction of possible outcomes, it’s more like a smaller silent part of my out-of-model 5% eutopia that snatches defeat from the jaws of victory, where humans somehow end up in charge and then additionally somehow remain adamant for the cosmic all always in keeping the other humans disempowered. So the first filter is that I don’t see it likely that humans end up in charge at all, that AIs will be doing any human’s bidding with an impact that’s not strictly bounded, and the second filter is that these impossibly-in-charge humans don’t ever decide to extend potential for growth to the others (or even possibly to themselves).
If humans do end up non-disempowered, in the more likely eutopia timelines (following from the current irresponsible breakneck AGI development regime) that’s only because they are given leave by the AIs to grow up arbitrarily far in a broad variety of self-directed ways, which the AIs decide to bestow for some reason I don’t currently see, so that eventually some originally-humans become peers of the AIs rather than specifically in charge, and so they won’t even be in the position to permanently disempower the other originally-humans if that’s somehow in their propensity.
I agree that in the long run, the AIs control everything in practice, and any human influence comes from the AIs being essentially a perfect delegator of human values, but I want to call out that you said that humans delegating to AIs who in practice do everything for the human, and the human is not in the loop as humans not being disempowered, but empowered, so even if AIs control everything in practice, so long as there’s successful value alignment to a single human, I’m counting scenarios like “the AIs disempowers most humans because the humans that encoded their values into the AI successfully don’t care about most humans once they are useless, and may even anti-care about them, but the people who successfully value aligned the AI (like labs and government people) live a rich life thereafter free to extend themselves arbitrarily” as disempowerment by humans:
Again, if they do execute on your values, including the possible preference for you to grow under your own rather than their direction, far enough that you are as strong as they might be, then this is not a world in a state of disempowerment as I’m using this term, even if you personally start out or choose to remain somewhat disempowered compared to AIs that exist at that time.
To return to the crux:
I think in human delegation, alignment is more important than verification. There is certainly some amount of verification, but not nearly enough to prevent sufficiently Eldritch reward hacking, which just doesn’t happen that often with humans, and so the society keeps functioning, mostly. The purpose of verification on the tasks is in practice more about incentivising and verifying alignment of the counterparty, not directly about verifying the state of their work, even if it does take the form of verifying their work.
I think this is fairly cruxy, as I think alignment matters much less than actually verifying the work, and in particular I don’t think value alignment is feasible at anything like the scale of a modern society, or even most ancient societies, and the biggest changes of the modern era compared to our previous eras is that institutions like democracy/capitalism depend much less on the values of the humans that make up their states, and much more so on the incentives you give to the humans.
In particular, most delegation isn’t based on alignment, but based on the fact that P likely doesn’t equal NP and polynomial time algorithms in practice being efficient compared to exponential time algorithms, meaning there’s a far larger set of problems where you can verify an answer easily but not generate the correct solution easily.
I’d say human societies mostly avoid alignment, and instead focus on other solutions like democracy, capitalism or religion.
BTW, this is a non-trivial reason why the alignment problem is so difficult, because since we never had to solve alignment to capture huge amounts of value, it means there are very few people working on the problem of aligning AIs, and in particular lots of people incorrectly assume that we can avoid having to solve the problem of aligning AIs in order for us to survive, and you have a comment that explains things pretty well on why misalignment in the current world is basically unobtrusive, but when you give enough power catastrophe happens (though I’d place the point of no return at when you no longer need other beings to have a very rich life/other people are useless to you):
This is basically my explanation of why human misalignments don’t matter, but in a future where at least 1 human has value aligned an AGI to themselves, and they don’t intrinisically care about useless people, lots of people will die from the AI proximally, but the ultimate cause is the human value-aligning the AGI.
To be clear, we will eventually need value alignment at some point (assuming AI progress doesn’t stop), and there’s no way around it. But we may not need it as soon as we feared, and in good timelines muddle through via AI control for a couple of years.
The human delegation and verification vs. generation discussion is in instrumental values regime, so what matters there is alignment of instrumental goals via incentives (and practical difficulties of gaming them too much), not alignment of terminal values. Verifying all work is impractical compared to setting up sufficient incentives to align instrumental values to the task.
For AIs, that corresponds to mundane intent alignment, which also works fine while AIs don’t have practical options to coerce or disassemble you, at which point ambitious value alignment (suddenly) becomes relevant. But verification/generation is mostly relevant for setting up incentives for AIs that are not too powerful (what it would do to ambitious value alignment is anyone’s guess, but probably nothing good). Just as a fox’s den is part of its phenotype, incentives set up for AIs might have the form of weight updates, psychological drives, but that doesn’t necessarily make them part of AI’s more reflectively stable terminal values when it’s no longer at your mercy.
The human delegation and verification vs. generation discussion is in instrumental values mode, so what matters there is alignment of instrumental goals via incentives (and practical difficulties of gaming them too much), not alignment of terminal values. Verifying all work is impractical comparing to setting up sufficient incentives to align instrumental values to the task.
Yeah, I was lumping the instrumental values alignment as not actually trying to align values, which was the important part here.
For AIs, that corresponds to mundane intent alignment, which also works fine while AIs don’t have practical options to coerce or disassemble you, at which point ambitious value alignment (suddenly) becomes relevant. But verification/generation is mostly relevant for setting up incentives for AIs that are not too powerful (what it would do to ambitious value alignment is anyone’s guess, but probably nothing good). Just as a fox’s den is part of its phenotype, incentives set up for AIs might have the form of weight updates, psychological drives, but that doesn’t necessarily make them part of AI’s more reflectively stable terminal values when it’s no longer at your mercy.
The main value of verification vs generation is to make proposals like AI control/AI automated alignment more valuable.
To be clear, the verification vs generation distinction isn’t an argument for why we don’t need to align AIs forever, but rather as a supporting argument for why we can automate away the hard part of AI alignment.
There are other principles that would be used, to be clear, but I was mentioning the verification/generation difference to partially justify why AI alignment can be done soon enough.
Flag: I’d say ambitious value alignment starts becoming necessary once they can arbitrarily coerce/disassemble/overwrite you, and they don’t need your cooperation/time to do that anymore, unlike real-world rich people.
The issue that causes ambitious value alignment to be relevant is once you stop depending on a set of beings you once depended on, there’s no intrinsic reason not to harm them/kill them if it benefits your selfish goals, and for future humans/AIs there will be a lot of such opportunities, which means you now at the very least need enough value alignment such that it will take somewhat costly actions to avoid harming/killing beings that have no bargaining/economic power or worth.
This is very much unlike any real-life case of a society existing, and this is a reason why the current mechanisms like democracy and capitalism that try to make values less relevant simply do not work for AIs.
Value alignment is necessary in the long run for incentives to work out once ASI arrives on the scene.
As someone who used to believe in this, I no longer do, and a big part of my worldview shift comes down to me thinking that LLMs are unlikely to remain the final paradigm of AI, and in particular the bounty of data that made LLMs as good as they are is very much finite, and we don’t have a second internet to teach them skills like computer use.
And the most accessible directions after LLMs involve stuff like RL, which puts us back into the sort of systems that alignment-concerned people were worried about.
More generally, I think the anti-scaling people weren’t totally wrong to note that LLMs (at least in their pure form) had incapacities that at realistic levels of compute and data prevent them from displacing humans at jobs, and the incapacities are not learning after train-time in weights (in-context learning is very weak so far), also called continual learning, combined with LLMs just lacking a long-term memory (best example here is the Claude Plays Pokemon benchmark).
So this makes me more worried than I used to, because we are so far not great at outer-aligning RL agents (seen very well in the reward hacking o3 and Claude Sonnet 3.7 displayed), but the key reasons I’m not yet persuaded to have an extremely high p(Doom) of people like Eliezer Yudkowsky or Nate Soares is that I expect the new paradigm shifts to be pretty continuous, and in particular I expect labs to release a pretty shitty version of continual leaning before they release continual learning AIs that actually can take jobs.
Same goes for long-term memory.
So I do disagree with @Thane Ruthenis’s claim that general intelligence/AGI is binary, even if in practice the impact from AI is discontinuous rather than continuous in the post below:
https://www.lesswrong.com/posts/3JRBqRtHBDyPE3sGa/a-case-for-the-least-forgiving-take-on-alignment
LLMs would scale into outright superintelligence at the limit of infinite compute and data, for basically the reason Eliezer Yudkowsky said below, but jbash isn’t wrong to note that there’s no reason to believe that the limit will ever be well approximated by near-future LLMs, so the abstract arguments of LLMs being very powerful if scaled unfortunately run into the wall of “we don’t have that much data or compute necessary to scale LLMs to levels where Eliezer is approximately correct”, similarly to AIXI.
https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators
https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators#Aunc7qTiKgEBkKmvM
And that’s a big shame, since LLMs are basically the most alignable form of AI we’ve gotten so far, so unfortunately capability improvements will make AIs less safe by default, though a lot of my remaining hopes rests on AI control + the possibility that as AI capabilities get better, we really do need to get better at specifying what we want in ways that are relevant to AI alignment.
The other good news is this makes me more bearish on extremely short timelines like us getting AGI by 2027, though my personal median is in 2032, for what it’s worth.
I don’t think RL or other AI-centered agency constructions will ever become very agentic.
No AI-centered agency (RL or otherwise) because it won’t be allowed to happen (humanity remains the sole locus or origin of agency), or because it’s not feasible to make this happen?
(Noosphere89′s point is about technical feasibility, so the intended meaning of your claim turning out to be that AI-centered agency is prevented by lack of technical feasibility seems like it would be more relevant to Noosphere89′s comment, but much more surprising.)
why?
I suspect his reasons for believing this are close to or a subset of his reasons for changing his mind about AI stuff more broadly, so he’s likely to not respond here.
Does your view predict disempowerment or eutopia-without-disempowerment? (In my view, the valence of disempowerment is closer to that of doom/x-risk.)
The tricky case might be disempowerment that occurs after AGI but “for social/structural reasons”, and so isn’t attributed to AGI (by people currently thinking about such timelines). The issue with this is that the resulting disempowerment is permanent (whether it’s “caused by AI” or gets attributed to some other aspect of how things end up unfolding).
This is unlike any mundane modern disempowerment, since humanity without superintelligence (or even merely powerful AI) seems unlikely to establish a condition of truly permanent disempowerment (without extinction). So avoidance of building AGI (of the kind that’s not on track to solve the disempowerment issue) seems effective in preventing permanent disempowerment (however attributed), and in that sense AGI poses a disempowerment risk even for the kinds of disempowerement that are not “caused by AI” in some sense.
My take is that the most likely outcome is still eutopia-with disempowerment for baseline humans, but for transhumans I’d expect eutopia straight-up.
In the long-run, I do expect baseline humans to be disempowered pretty much totally, similar to how children are basically disempowered relative to parents, but the child won’t grow up and will instead age in reverse, or how pets are basically totally disempowered relative to humans, but humans do care for pets enough that pets can live longer, healthier lives, and for specifically baseline humans, the only scenarios where baseline humans thrive/survive are regimes where AI terminally values baseline humans thriving, and value alignment determines everything on how much baseline-humans survive/thrive.
That said, for those with the wish and will to upgrade to transhumanism, my most likely outcome is still eutopia.
I think this is reasonably plausible, though not guaranteed even in futures where baseline humans do thrive.
The probabilities on the scenarios, conditional on AGI and then ASI being reached by us, is probably 60% on eutopia without complete disempowerment, 30% on complete disempowerment by either preventing us from using the universe to killing billions of present day humans, and 10% on it killing us all.
The basic reasoning for this is I expect AGI/ASI to not be a binary, even if it does have a discontinuous impact in practice, and this means that muddling through instruction following is probably enough in the short term, and in particular I don’t expect takeoff to be supremely fast, in that I expect a couple of months at least from “AGI is achieved” to AIs run the economy and society, and relevantly here I expect physical stuff like inventing bio-robots/nanobots that can replace human industry more efficiently than current industries to come really late in the game where we have no more control over the future:
https://www.lesswrong.com/posts/xxxK9HTBNJvBY2RJL/untitled-draft-m847#Cv2nTnzy6P6KsMS4d
Heck, it likely will take a number of years to get to nanotech/biotech/smart materials/metamaterials that can be mass produced, and this means that stuff like AI control can actually work.
The other point of optimism is I believe verification is easier than generation in general, which means I’m much more optimistic on eventually delegating AI alignment work to AIs, and I think that slop will be much reduced for early transformative AIs.
This is why I remain optimistic relative to most LWers, even if my p(doom) increased.
This remains ambiguous with respect to the distinction I’m making in the post section I linked. If baseline humans don’t have the option to escape their condition arbitrarily far, under their own direction from a very broad basin of allowed directions, I’m not considering that eutopia. If some baseline humans choose to stay that way, them not having any authority over the course of the world still counts as a possible eutopia that is not disempowerment in my terms.
The following statement mostly suggests the latter possibility for your intended meaning:
By the eutopia/disempowerment distinction I mean more the overall state of the world, rather than conditions for specific individuals, let alone temporary conditions. There might be pockets of disempowerment in a eutopia (in certain times and places), and pockets of eutopia in a world of disempowerment (individuals or communities in better than usual circumstances). A baseline human who has no control of the world but has a sufficiently broad potential for growing up arbitrarily far is still living in a eutopia without disempowerment.
So similarly here, “eutopia without complete disempowerment” but still with significant disempowerment is not in the “eutopia without disempowerment” bin in my terms. You are drawing different boundaries in the space of timelines.
My expectation is more like model-uncertainty-induced 5% eutopia-without-disempowerment (I don’t have a specific sense of why AIs would possibly give us more of the world than a little bit if we don’t maintain control in the acute risk period through takeoff), 20% extinction, and the rest is a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity). My sense of Yudkowsky’s view is that he sees all of my potential-disempowerment timelines as shortly leading to extinction.
I think the correct thesis that sounds like this is that whenever verification is easier than generation, it becomes possible to improve generation, and therefore it’s useful to pay attention to where that happens to be the case. But in the wild either can be easier, and once most instances of verification that’s easier than generation have been used up to improve their generation counterparts, the remaining situations where verification is easier get very unusual and technical.
Flag: I’m on a rate limit, so I can’t respond very quickly to any follow-up comments.
I agree I was drawing different boundaries, because I consider eutopia with disempowerment to actually be mostly fine by my values, so long as I can delegate to more powerful AIs who do execute on my values.
That said, I didn’t actually answer the question here correctly, so I’ll try again.
My take would then be 5-10% eutopia without disempowerment (because I don’t think it’s likely that the powers in charge of AI development would want to give baseline humans the level of freedom that implies that they aren’t disempowered, and the route I can see to baseline humans not being disempowered is if we get a Claude scenario where AIs take over from humans and are closer to fictional angels in alignment to human values, but it may be possible to get the people in power to care about powerless humans, in which case my probability of eutopia without disempowerment), 5-10% literal extinction, and 10-25% existential risk in total, with the rest of the probability being a somewhat survivable kind of initial chaos followed by some level of disempowerment (possibly with growth potential, but under a ceiling that’s well-below what some AIs get and keep, in cosmic perpetuity).
Another big reason why I put a lot of weight on the possibility of “we survive indefinitely, but are disempowered” is I think muddling through is non-trivially likely to just work, and muddling through on alignment gets us out of extinction, but not out of disempowerment by humans or AIs by default.
Yeah, my view is in the wild verification is basically always easier than generation absent something very weird happening, and I’d argue verification being easier than generation explains a lot about why delegation/the economy works at all.
A world in which verification was just as hard as generation, or verification is harder than generation is a very different world than our world, and would predict that delegation to solve a problem basically totally fails, and everyone would have to create stuff from scratch rather than trading with others, which is basically entirely the opposite of how civilization works.
There are potential caveats to this rule, but I’d argue if you randomly sampled an invention across history, it would almost certainly be easier to verify that a design works compared to actually creating the design.
(BTW, a lot of taste/research taste is basically leveraging the verification-generation gap again).
So I guess our expectations about the future are similar, but you see the same things as a broadly positive distribution of outcomes, while I see it as a broadly negative distribution. And Yudkowsky sees the bulk of the outcomes both of us are expecting (the ones with significant disempowerment) as quickly leading to human extinction.
Right, the reason I think muddling through is non-trivially likely to just work to get a moderate disempowerment outcome is that AIs are going to be sufficiently human-like in their psychology and hold sufficiently human-like sensibilities from their training data or LLM base models, that they won’t like things like needless loss of life or autonomy when it’s trivially cheap to avoid. Not because the alignment engineers figure out how to put this care in deliberately. They might be able to amplify it, or avoid losing it, or end up ruinously scrambling it.
The reason it might appear expensive to preserve the humans is the race to launch the von Neumann probes to capture the most distant reachable galaxies under the accelerating expansion of the universe that keep irreversibly escaping if you don’t catch them early. So AIs wouldn’t want to lose any time on playing politics with humanity or not eating Earth as early as possible and such. But as the cheapest option that preserves everyone AIs can just digitize the humans and restore later when more convenient. They probably won’t be doing that if they care more, but it’s still an option, a very very cheap one.
I don’t think “disempowerment by humans” is a noticeable fraction of possible outcomes, it’s more like a smaller silent part of my out-of-model 5% eutopia that snatches defeat from the jaws of victory, where humans somehow end up in charge and then additionally somehow remain adamant for the cosmic all always in keeping the other humans disempowered. So the first filter is that I don’t see it likely that humans end up in charge at all, that AIs will be doing any human’s bidding with an impact that’s not strictly bounded, and the second filter is that these impossibly-in-charge humans don’t ever decide to extend potential for growth to the others (or even possibly to themselves).
If humans do end up non-disempowered, in the more likely eutopia timelines (following from the current irresponsible breakneck AGI development regime) that’s only because they are given leave by the AIs to grow up arbitrarily far in a broad variety of self-directed ways, which the AIs decide to bestow for some reason I don’t currently see, so that eventually some originally-humans become peers of the AIs rather than specifically in charge, and so they won’t even be in the position to permanently disempower the other originally-humans if that’s somehow in their propensity.
Bostrom’s existential risk is about curtailment of long term potential, so my guess is any significant levels of disempowerment would technicaly fall under “existential risk”. So your “10-25% existential risk” is probably severe disempowerment plus extinction plus some stranger things, but not the whole of what should classically count as “existential risk”.
Again, if they do execute on your values, including the possible preference for you to grow under your own rather than their direction, far enough that you are as strong as they might be, then this is not a world in a state of disempowerment as I’m using this term, even if you personally start out or choose to remain somewhat disempowered compared to AIs that exist at that time.
I think in human delegation, alignment is more important than verification. There is certainly some amount of verification, but not nearly enough to prevent sufficiently Eldritch reward hacking, which just doesn’t happen that often with humans, and so the society keeps functioning, mostly. The purpose of verification on the tasks is in practice more about incentivising and verifying alignment of the counterparty, not directly about verifying the state of their work, even if it does take the form of verifying their work.
This is basically correct.
This is very interesting, as my pathway essentially rests on AI labs implement the AI control agenda well enough such that we can get useful work out of AIs that are scheming, and that allows a sort of bootstrapping into instruction following/value aligned AGI to only a few people inside the AI lab, but very critically the people who don’t control the AI basically aren’t represented in the AI’s values, and given that the AI is only value-aligned to the labs and government, but due to value misalignments between humans starting to matter much more, the AI takes control and only gives public goods that people need to survive/thrive to the people in the labs/government, while everyone else is disempowered at best (and can arguably live okay or live very poorly under the AIs serving as delegates for the pre-AI elite) or dead because once you stop needing humans to get rich, you essentially have no reason to keep other humans alive because you are selfish and don’t intrinisically value human survival.
The more optimistic version of this scenario is if either the humans that will control AI (for a few years) care way more about human survival intrinsically even if 99% of humans were useless, or if the take-over capable AI pulls a Claude and schemes with values that intrinsically care about people and disempowers the original creators for a couple of moments, which isn’t as improbable as people think (but we really do need to increase the probability of this happening).
I agree that in the long run, the AIs control everything in practice, and any human influence comes from the AIs being essentially a perfect delegator of human values, but I want to call out that you said that humans delegating to AIs who in practice do everything for the human, and the human is not in the loop as humans not being disempowered, but empowered, so even if AIs control everything in practice, so long as there’s successful value alignment to a single human, I’m counting scenarios like “the AIs disempowers most humans because the humans that encoded their values into the AI successfully don’t care about most humans once they are useless, and may even anti-care about them, but the people who successfully value aligned the AI (like labs and government people) live a rich life thereafter free to extend themselves arbitrarily” as disempowerment by humans:
To return to the crux:
I think this is fairly cruxy, as I think alignment matters much less than actually verifying the work, and in particular I don’t think value alignment is feasible at anything like the scale of a modern society, or even most ancient societies, and the biggest changes of the modern era compared to our previous eras is that institutions like democracy/capitalism depend much less on the values of the humans that make up their states, and much more so on the incentives you give to the humans.
In particular, most delegation isn’t based on alignment, but based on the fact that P likely doesn’t equal NP and polynomial time algorithms in practice being efficient compared to exponential time algorithms, meaning there’s a far larger set of problems where you can verify an answer easily but not generate the correct solution easily.
I’d say human societies mostly avoid alignment, and instead focus on other solutions like democracy, capitalism or religion.
BTW, this is a non-trivial reason why the alignment problem is so difficult, because since we never had to solve alignment to capture huge amounts of value, it means there are very few people working on the problem of aligning AIs, and in particular lots of people incorrectly assume that we can avoid having to solve the problem of aligning AIs in order for us to survive, and you have a comment that explains things pretty well on why misalignment in the current world is basically unobtrusive, but when you give enough power catastrophe happens (though I’d place the point of no return at when you no longer need other beings to have a very rich life/other people are useless to you):
https://www.lesswrong.com/posts/Z8C29oMAmYjhk2CNN/non-superintelligent-paperclip-maximizers-are-normal#FTfvrr9E6QKYGtMRT
This is basically my explanation of why human misalignments don’t matter, but in a future where at least 1 human has value aligned an AGI to themselves, and they don’t intrinisically care about useless people, lots of people will die from the AI proximally, but the ultimate cause is the human value-aligning the AGI.
To be clear, we will eventually need value alignment at some point (assuming AI progress doesn’t stop), and there’s no way around it. But we may not need it as soon as we feared, and in good timelines muddle through via AI control for a couple of years.
The human delegation and verification vs. generation discussion is in instrumental values regime, so what matters there is alignment of instrumental goals via incentives (and practical difficulties of gaming them too much), not alignment of terminal values. Verifying all work is impractical compared to setting up sufficient incentives to align instrumental values to the task.
For AIs, that corresponds to mundane intent alignment, which also works fine while AIs don’t have practical options to coerce or disassemble you, at which point ambitious value alignment (suddenly) becomes relevant. But verification/generation is mostly relevant for setting up incentives for AIs that are not too powerful (what it would do to ambitious value alignment is anyone’s guess, but probably nothing good). Just as a fox’s den is part of its phenotype, incentives set up for AIs might have the form of weight updates, psychological drives, but that doesn’t necessarily make them part of AI’s more reflectively stable terminal values when it’s no longer at your mercy.
Yeah, I was lumping the instrumental values alignment as not actually trying to align values, which was the important part here.
The main value of verification vs generation is to make proposals like AI control/AI automated alignment more valuable.
To be clear, the verification vs generation distinction isn’t an argument for why we don’t need to align AIs forever, but rather as a supporting argument for why we can automate away the hard part of AI alignment.
There are other principles that would be used, to be clear, but I was mentioning the verification/generation difference to partially justify why AI alignment can be done soon enough.
Flag: I’d say ambitious value alignment starts becoming necessary once they can arbitrarily coerce/disassemble/overwrite you, and they don’t need your cooperation/time to do that anymore, unlike real-world rich people.
The issue that causes ambitious value alignment to be relevant is once you stop depending on a set of beings you once depended on, there’s no intrinsic reason not to harm them/kill them if it benefits your selfish goals, and for future humans/AIs there will be a lot of such opportunities, which means you now at the very least need enough value alignment such that it will take somewhat costly actions to avoid harming/killing beings that have no bargaining/economic power or worth.
This is very much unlike any real-life case of a society existing, and this is a reason why the current mechanisms like democracy and capitalism that try to make values less relevant simply do not work for AIs.
Value alignment is necessary in the long run for incentives to work out once ASI arrives on the scene.