AFAIK, I have similar values[1] but lean differently.
~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.
This means you’re P(doom) given our current trajectory very much matters. If you’re P(doom) is <1%, then pushing back a year isn’t worth it.
The expected change conditioning on accelerating also matters. If accelerating by 1 year increases e.g. global tensions, increasing a war between nuclear states by X% w/ an expected Y-deaths (I could see arguments either way though, haven’t thought too hard about this).
For me, I’m at ~10% P(doom). Whether I’d accept a proposed slowdown depends on how much I expect it decrease this number.[2]
How do you model this situation? (also curious on your numbers)
Assumptions:
We care about currently living people equally (alternatively, if you cared mostly about your young children, you’d happily accept a reduction in x-risk of 0.1% (possibly even 0.02%). Actuary table here)
Using expected value, which only mostly matches my intuitions (e.g. I’d actually accept pushing back 2 years for a reduction of x-risk from 1% to ~0%)
So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.
Only if you don’t care at all about people who aren’t yet born. I’m assuming that’s your position, but you didn’t state it as one of your two assumptions and I think it’s an important one.
The answer also changes if you believe nonhumans are moral patients, but it’s not clear which direction it changes.
Correct! I did mean to communicate that in the first footnote. I agree value-ing the unborn would drastically lower the amount of acceptable risk reduction.
Note that unborn people are merely potential, as their existence depends on our choices. Future generations aren’t guaranteed—we decide whether or not they will exist, particularly those who might be born decades or centuries from now. This makes their moral status far less clear than someone who already exists or who is certain to exist at some point regardless of our choices.
Additionally, if we decide to account for the value of future beings, we might consider both potential human people and future AI entities capable of having moral value. From a utilitarian perspective, both human and AI welfare presumably matters. This makes the ethical calculus more complicated, as the dilemma isn’t merely about whether we risk losing all future generations, but rather whether we risk shifting posterity from humans to AIs.
Personally, I’m largely comfortable evaluating our actions primarily—though not entirely—based on their impact on current human lives, or at least people (and animals) who will exist in the near-term. I value our present generation. I want us to keep living and to thrive. It would be a tragedy if we either went extinct or died from aging. However, to the extent that I care about distant future generations, my concern is substrate-impartial, and I don’t particularly favor humans over AIs.
my concern is substrate-impartial, and I don’t particularly favor humans over AIs.
Do you care whether AIs are sentient (or, are there particular qualities you expect entities need to be valuable?). Do you basically expect any AI capable of overtaking humans to have those qualities?
(btw, I appreciate that even though you disagree a bunch with several common LW-ish viewpoints you’re still here talking through things)
I am essentially a preference utilitarian and an illusionist regarding consciousness. This combination of views leads me to conclude that future AIs will very likely have moral value if they develop into complex agents capable of long-term planning, and are embedded within the real world. I think such AIs would have value even if their preferences look bizarre or meaningless to humans, as what matters to me is not the content of their preferences but rather the complexity and nature of their minds.
When deciding whether to attribute moral patienthood to something, my focus lies primarily on observable traits, cognitive sophistication, and most importantly, the presence of clear open-ended goal-directed behavior, rather than on speculative or less observable notions of AI welfare, about which I am more skeptical. As a rough approximation, my moral theory aligns fairly well with what is implicitly proposed by modern economists, who talk about revealed preferences and consumer welfare.
Like most preference utilitarians, I believe that value is ultimately subjective: loosely speaking, nothing has inherent value except insofar as it reflects a state of affairs that aligns with someone’s preferences. As a consequence, I am comfortable, at least in principle, with a wide variety of possible value systems and future outcomes. This means that I think a universe made of only paperclips could have value, but only if that’s what preference-having beings wanted the universe to be made out of.
To be clear, I also think existing people have value too, so this isn’t an argument for blind successionism. Also, it would be dishonest not to admit that I am also selfish to a significant degree (along with almost everyone else on Earth). What I have just described simply reflects my broad moral intuitions about what has value in our world from an impartial point of view, not a prescription that we should tile the universe with paperclips. Since humans and animals are currently the main preference-having beings in the world, at the moment I care most about fulfilling what they want the world to be like.
I agree that this sort of preference utilitarianism leads you to thinking that long run control by an AI which just wants paperclips could be some (substantial) amount good, but I think you’d still have strong preferences over different worlds.[1] The goodness of worlds could easily vary by many orders of magnitude for any version of this view I can quickly think of and which seems plausible. I’m not sure whether you agree with this, but I think you probably don’t because you often seem to give off the vibe that you’re indifferent to very different possibilities. (And if you agreed with this claim about large variation, then I don’t think you would focus on the fact that the paperclipper world is some small amount good as this wouldn’t be an important consideration—at least insofar as you don’t also expect that worlds where humans etc retain control are similarly a tiny amount good for similar reasons.)
The main reasons preference utilitarianism is more picky:
Preferences in the multiverse: Insofar as you put weight on the preferences of beings outside our lightcone (beings in the broader spatially infinte universe, Everett branches, the broader mathematical multiverse to the extent you put weight on this), then the preferences of these beings will sometimes care about what happens in our lightcone and this could easily dominate (as they are vastly more numerious and many might care about things independent of “distance”). In the world with the successful paperclipper, just as many preferences aren’t being fulfilled. You’d strongly prefer optimization to satisfy as many preferences as possible (weighted as you end up thinking is best).
Instrumentally constructed AIs with unsatisfied preferences: If future AIs don’t care at all about preference utilitarianism, they might instrumentally build other AIs who’s preferences aren’t fulfilled. As an extreme example, it might be that the best strategy for a paper clipper is to construct AIs which have very different preferences and are enslaved. Even if you don’t care about ensuring beings come into existence who’s preference are satisified, you might still be unhappy about creating huge numbers of beings who’s preferences aren’t satisfied. You could even end up in a world where (nearly) all currently existing AIs are instrumental and have preferences which are either unfulfilled or only partially fulfilled (a earlier AI initiated a system that perpetuates this, but this earlier AI no longer exists as it doesn’t care terminally about self-preservation and the system it built is more efficient than it).
AI inequality: It might be the case that the vast majority of AIs have there preferences unsatisfied despite some AIs succeeding at achieving their preference. E.g., suppose all AIs are replicators which want to spawn as many copies as possible. The vast majority of these replicator AI are operating at subsistence and so can’t replicate making their preferences totally unsatisfied. This could also happen as a result of any other preference that involves constructing minds that end up having preferences.
Weights over numbers of beings and how satisfied they are: It’s possible that in a paperclipper world, there are really a tiny number of intelligent beings because almost all self-replication and paperclip construction can be automated with very dumb/weak systems and you only occasionally need to consult something smarter than a honeybee. AIs could also vary in how much they are satisfied or how “big” their preferences are.
I think the only view which recovers indifference is something like “as long as stuff gets used and someone wanted this at some point, that’s just as good”. (This view also doesn’t actually care about stuff getting used, because there is someone existing who’d prefer the universe stays natural and/or you don’t mess with aliens.) I don’t think you buy this view?
To be clear, it’s not immediately obvious whether a preference utilitarian view like the one you’re talking about favors human control over AIs. It certainly favors control by that exact flavor of preference utilitarian view (so that you end up satisfying people across the (multi-/uni-)verse with the correct weighting). I’d guess it favors human control for broadly similar reasons to why I think more experience-focused utilitarian views also favor human control if that view is in a human.
And, maybe you think this perspective makes you so uncertain about human control vs AI control that the relative impacts current human actions could have are small given how much you weight long term outcomes relative to other stuff (like ensuring currently existing humans get to live for at least 100 more years or similar).
On my best guess moral views, I think there is goodness in the paper clipper universe but this goodness (which isn’t from (acausal) trade) is very small relative to how good the universe can plausibly get. So, this just isn’t an important consideration but I certainly agree there is some value here.
Want to try answering my questions/problems about preference utilitarianism?
Maybe I would state my first question above a little differently today: Certain decision theories (such as the UDT/FDT/LDT family) already incorporate some preference-utilitarian-like intuitions, by suggesting that taking certain other agents’ preferences into account when making certain decisions is a good idea, if e.g. this is logically correlated with them taking your preferences into account. Does preference utilitarianism go beyond this, and say that you should take their preferences into account even if there is no decision theoretic reason to do so, as a matter of pure axiology (values / utility function)? Do you then take their preferences into account again as part of decision theory, or do you adopt a decision theory which denies or ignores such correlations/linkages/reciprocities (e.g., by judging them to be illusions or mistakes or some such)? Or does your preference utilitarianism do something else, like deny the division between decision theory and axiology? Also does your utility function contain non-preference-utilitarian elements, i.e., idiosyncratic preferences that aren’t about satisfying other agents’ preferences, and if so how do you choose the weights between your own preferences and other agents’?
(I guess this question/objection also applies to hedonic utilitarianism, to a somewhat lesser degree, because if a hedonic utilitarian comes across a hedonic egoist, he would also “double count” the latter’s hedons, once in his own utility function, and once again if his decision theory recommends taking the latter’s preferences into account. Another alternative that avoids this “double counting” is axiological egoism + some sort of advanced/cooperative decision theory, but then selfish values has its own problems. So my own position on is topic is one of high confusion and uncertainty.)
For me, I’m at ~10% P(doom). Whether I’d accept a proposed slowdown depends on how much I expect it decrease this number.[2]
How do you model this situation? (also curious on your numbers)
I put the probability that AI will directly cause humanity to go extinct within the next 30 years at roughly 4%. By contrast, over the next 10,000 years, my p(doom) is substantially higher, as humanity could vanish for many different possible reasons, and forecasting that far ahead is almost impossible. I think a pause in AI development matters most for reducing the near-term, direct AI-specific risk, since the far-future threats are broader, more systemic, harder to influence, and only incidentally involve AI as a byproduct of the fact that AIs will be deeply embedded in our world.
I’m very skeptical that a one-year pause would meaningfully reduce this 4% risk. This skepticism arises partly because I doubt much productive safety research would actually happen during such a pause. In my view, effective safety research depends heavily on an active feedback loop between technological development and broader real-world applications and integration, and pausing the technology would essentially interrupt this feedback loop. This intuition is also informed by my personal assessment of the contributions LW-style theoretical research has made toward making existing AI systems safe—which, as far as I can tell, has been almost negligible (though I’m not implying that all safety research is similarly ineffective or useless).
I’m also concerned about the type of governmental structures and centralization of power required to enforce such a pause. I think pausing AI would seriously risk creating a much less free and dynamic world. Even if we slightly reduce existential risks by establishing an international AI pause committee, we should still be concerned about the type of world we’re creating through such a course of action. Some AI pause proposals seem far too authoritarian or even totalitarian to me, providing another independent reason why I oppose pausing AI.
Additionally, I think that when AI is developed, it won’t merely accelerate life-extension technologies and save old people’s lives; it will likely also make our lives vastly richer and more interesting. I’m excited about that future, and I want the 8 billion humans alive today to have the opportunity to experience it. This consideration adds another important dimension beyond merely counting potential lives lost, again nudging me towards supporting acceleration.
Overall, the arguments in favor of pausing AI seem surprisingly weak to me, considering the huge potential upsides from AI development, my moral assessment of the costs and benefits, my low estimation of the direct risk from misaligned AI over the next 30 years, and my skepticism about how much pausing AI would genuinely reduce AI risks.
I’m very skeptical that a one-year pause would meaningfully reduce this 4% risk. This skepticism arises partly because I doubt much productive safety research would actually happen during such a pause. In my view, effective safety research depends heavily on an active feedback loop between technological development and broader real-world applications and integration, and pausing the technology would essentially interrupt this feedback loop.
I’m going to try to quickly make the case for the value of a well-timed 2-year pause which occurs only in some conditions (conditions which seem likely to me but which probably seem unlikely to you). On my views, such a pause would cut the risk of misaligned AI takeover (as in, an AI successfully seizing a large fraction of power while this is unintended by its de facto developers) by around 1⁄2 or maybe 1⁄3.[1]
I think the ideal (short) pause/halt/slowdown from my perspective would occur around the point when AIs are capable enough to automate all safety relevant work and would only halt/slow advancement in general underlying capability. So, broader real-world applications and integrations could continue as well as some types of further AI development which don’t improve generally applicable capabilities. (It might also be acceptable to train cheaper or faster AIs and to improve algorithms but not yet train an AI which substantially surpasses this fixed level of general ability.)
A bunch of the reason why I think a well-timed slowdown might be good is that default takeoff speeds might be very fast. For instance, you might go from something like the superhuman AI researcher level (AIs which are qualitatively similar in general capabilities to human experts and which can automate AI R&D) to very qualitatively superhuman AIs in less than a year, and possibly (as in the case of AI 2027) in less than 4 months. If these takeoff speeds are what would happen by default, this transition probably requires either slowing down or very quickly handing off alignment and safety work to (hopefully sufficiently aligned) AIs which naively seems very scary.
Note that in this fast of a takeoff, we might only have AIs which are sufficiently capable that a full (safe) handoff is in principle viable for a few months before we need to do this handoff. So, humans wouldn’t have time to see much of a feedback loop on deferring to these AIs and handing off the types of work we will ultimately need to hand off. In other words, the default pace of takeoff speeds would itself disrupt the feedback loops typically needed for safety research. We’d have some sense of what AIs are like based on earlier capabilities and we could try to extrapolate, but capabilities might be improving fast enough that our prior experience doesn’t transfer. Further, handing off extremely open-ended “wicked” tasks which are impossible for humans to directly verify/test might result in risks and difficulties which didn’t show up before.
My understanding is that you don’t think this fast of a takeoff is very likely and this informs your views on slowing down in the future. What about making the slowdown conditional on takeoff looking roughly this fast? We could look at how much AIs are accelerating progress and try to forecast various milestones, and then only slow down as is needed to ensure that the duration between “superhuman AI researcher level” and very qualitatively superhuman AI is at least 2.5 years. If the default duration looks like it will be much more than this, then no slowdown is needed. And, maybe on your views you think the default duration is very likely to be longer than 2.5 years?
If the default is that we’re going from human-ish level AIs to radically superhuman AIs in less than a year, then I think there is a pretty strong intuitive case that a slowdown considerably improves the chance that humans stay in control (at least temporarily). And, if you’re more worried about fatalities, the picture is similar (at least insofar as you agree that human AI developers rapidly losing control to AI systems will cause higher human fatalities).
Here’s a another way to put this case for delay conditional on quite fast takeoff: By default, at the point of full AI R&D automation humans might have only another few months of time to work on alignment prior to needing to handoff to AI systems (or some worse option). Additionally, only around 3-20% of the effort would be focused on safety relevant alignment by default. If instead add 2 years of delay and spend most of the effort in these years on alignment, that would increase from a few months at maybe 7% effort to 2 years at maybe 70% effort. This is a 10x increase in serial time and a 10x increase in effort during that serial time. There would be substantial prior work on alignment, but it might not transfer sufficiently (and might just not be sufficient given that earlier work wasn’t accelerated by AIs), so work during this period could be a key bottleneck. Most guesses about returns to effort would have this reduce risk by a decent amount given the large increase increase in overall effort and time while having access to a critical level of AI systems. More simply: Prima fascia, it seems like putting in much more work on a problem would be very helpful.
We’d also want to make the slowdown conditional on not immediately having sufficiently robust alignment that we’re quite confident rapidly handing off is safe. But, I’m quite skeptical we’ll have this quickly (and I’d guess you’d agree?) so I don’t think this makes a big difference to the bottom line.
Of course, there would still be serious implementation difficulties in actually implementing a well-timed conditional slowdown. And, operationalizing the exact criteria would be important.
I’m also concerned about the type of governmental structures and centralization of power required to enforce such a pause. I think pausing AI would seriously risk creating a much less free and dynamic world. Even if we slightly reduce existential risks by establishing an international AI pause committee, we should still be concerned about the type of world we’re creating through such a course of action.
Interestingly, I have the opposite view: a well-timed slowdown would probably reduce concentration of power, at least if takeoff would otherwise have been fast. If takeoff is quite fast, then the broader world won’t have much time to respond to developments which would make it more likely that power would greatly concentrate by default. People would need time to notice the situation and take measures to avoid being disempowered. As a more specific case, AI-enabled coups seem much more likely if takeoff is fast and thus intervening to slow down takeoff (so there is more time for various controls etc. to be put in place) would help a lot with that.
I think this effect is substantially larger than the (centralization, less dynamism, etc.) costs needed to enforce a 1-2 year slowdown. (Separately, I expect things probably will be so concentrated by default, that the additional requirements to enforce a 1-2 year slowdown seem pretty negligible in comparison. I can easily imagine the deals etc. made to enforce a slowdown decentralizing power on net (as it would require oversight by a larger number of actors and more humans to get some influence over the situation), though this presumably wouldn’t be the easiest way to achieve this objective. I think a situation pretty similar to the AI 2027 scenario where an extremely small group of people have massive de facto power is quite likely, and this could easily result in pretty close to maximal concentration of power longer term.)
Suppose we could do a reasonable job implementing a conditional slowdown like this where we try to ensure at least a 2.5 year gap (if alignment issues aren’t robustly solved) between full AI R&D automation and very qualitatively superhuman AI. Do you think such a slowdown would be good on your views and values?
My views are that misaligned AI takeover is about 30% likely. Conditional on misaligned AI takeover, I’d guess (with very low confidence) that maybe 1⁄2 of humans die in expectation with a 1⁄4 chance of literal human extinction. Interestingly, this means we don’t disagree that much about the chance that AI will directly cause humanity to go extinct in the next 30 years, I’d put around 6% on this claim and you’re at 4%. (6% = 85% chance of TAI, 30% takeover conditional on TAI, 25% chance of extinction.) However, as found in prior conversations, we do disagree a bunch on how bad misaligned AI takeover is for various reasons. It’s also worth noting that in some worlds where humans survive, they (or some fraction of them) might be mistreated by the AI systems with power over them in ways which make their lives substantially worse than they are now. So, overall, my sense is that from a myopic perspective that only cares about the lives of currently aligned humans, misaligned AI takeover is roughly as bad as 3⁄5 of people dying in expectation. So, if we think each year of delay costs the equivalent of 0.5% of humans dying and we only care about currently living humans, then a ~1/40th reduction in takeover risk is worth a year of delay on my views.
This intuition is also informed by my personal assessment of the contributions LW-style theoretical research has made toward making existing AI systems safe—which, as far as I can tell, has been almost negligible (though I’m not implying that all safety research is similarly ineffective or useless).
I know what you mean by “LW-style theoretical research” (edit: actually, not that confident I know what you mean, see thread below), but it’s worth noting that right now on LW people appear to be much more into empirical research than theoretical research. Concretely, go to All posts in 2024 sorted by Top and then filter by AI. Out of the top 32 posts, 0 are theoretical research and roughly 7⁄32 are empirical research. 1 or 2 out of 32 are discussion which is relatively pro-theoretical research and a bunch more (maybe 20) are well described as AI futurism or discussion of what research directions or safety strategies are best which is relatively focused on empirical approaches. LW has basically given up on LW-style theoretical research based on the top 32 posts. (One of the top 32 posts is actually a post which is arguably complaining about how the field of alignment has given up on LW-style theoretical research!)
Separately, I don’t think pessimism about LW-style theoretical research has a clear cut effect on how you should feel about a pause. The more you’re skeptical of work done in advance, the more you should think that additional work at done when we have more powerful AIs is a higher fraction of the action. This could be outweighed by generally being more skeptical about the returns to safety research as informed by this example subfield of safety research being poor, but still.
Also, it’s worth noting that almost everyone in the field is pessimistic about LW-style theoretical research! This isn’t a very controversial view. The main disagreements (at least on LW) tend to be more about how optimistic you are about empirical research and about different types of empirical research.
(I will go on the record that I think this comment seems to me terribly confused about what “LW style theoretic research” is. In-particular, I think of Redwood as one of the top organizations doing LW style theoretic research, with a small empirical component, and so clearly some kind of mismatch about concepts is going on here. AI 2027 also strikes me as very centrally the kind of “theoretical” thinking that characterizes LW.
My sense is some kind of weird thing is happening where people conjure up some extremely specific thing as the archetype of LW-style research, in ways that is kind of disconnected from reality, and I would like to avoid people forming annoyingly hard to fix stereotypes as a result of that)
I’m using the word “theoretical” more narrowly than you and not including conceptual/AI-futurism research. I agree the word “theoretical” is underdefined and there is a reasonable category that includes Redwood and AI 2027 which you could call theoretical research, I’d just typically use a different term for this and I don’t think Matthew was including this.
I was trying to discuss what I thought Matthew was pointing at, I could be wrong about this of course.
(Similarly, I’d guess that Matthew wouldn’t have counted Epoch’s work on takeoff speeds and what takeoff looks like as an example of “LW-style theoretical research”, but I think this work is very structurally/methodologically similar to stuff like AI 2027.”)
If Matthew said “LW-style conceptual/non-empirical research” I would have interpreted this pretty differently.
I am clearly coming from a very different set of assumptions! I have:
P(AGI within 10 years) = 0.5. This is probably too conservative, given that many of the actual engineers with inside knowledge place this number much higher in anonymous surveys.
P(ASI within 5 years|AGI) = 0.9.
P(loss of control within 5 years|ASI) > 0.9. Basically, I believe “alignment” is a fairy tale, that it’s Not Even Wrong.
If I do the math, that gives me a 40.5% chance that humans will completely lose control over the future within 20 years. Which seems high to me at first glance, but I’m willing to go with that.
The one thing I can’t figure out how to estimate is:
P(ASI is benevolent|uncontrolled ASI) = ???
I think that there are only a few ways the future is likely to go:
AI progress hits a wall, hard.
We have a permanent, worldwide moratorium on more advanced models. Picture a US/China/EU treaty backed up by military force, if you want to get dystopian about it.
An ASI decides humans are surplus to requirements.
An ASI decides that humans are adorable pets and it wants keep some of this around. This is the only place we get any “utopian” benefits, and it’s the utopia of being a domesticated animal with no ability to control its fate.
I support a permanent halt. I have no expectation that this will happen. I think building ASI is equivalent to BASE jumping in a wingsuit, except even more likely to end horribly.
So I also support mitigation and delay. If the human race has incurable, metastatic cancer, the remaining variable we control is how many good years we get before the end.
Could you give the source(s) of these anonymous surveys of engineers with insider knowledge about the arrival of AGI? I would be interested in seeing them.
Unfortunately, it was about 3 or 4 months ago, and I haven’t been able to find the source. Maybe something Zvi Mowshowitz linked to in a weekly update?
I am incredibly frustrated that web search is a swamp of AI spam, and tagged bookmarking tools like Delicious and Pinboard have been gone or unreliable for years.
~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.
That would imply that if you could flip a switch which 90% chance kills everyone, 10% chance grants immortality then (assuming there weren’t any alternative paths to immortality) you would take it. Is that correct?
Many of these arguments seem pathological when applied to an individual.
I have a friend, let’s call her B, she has a 6 year old daughter A. She of course adores her daughter. If I walked up to B and said “I’m going to inject this syringe into your daughter. There’s a 10% chance it’ll kill her, and a 50% chance it’ll extend her natural lifetime to 200.” Then I jab A.
EV on A’s life expectancy is strongly positive. B (and almost everybody) would be very upset if I did this. I’m upset with accelerationists for the same reasons.
AFAIK, I have similar values[1] but lean differently.
~1% of the world dies every year. If we accelerate AGI sooner 1 year, we save 1%. Push back 1 year, lose 1%. So, pushing back 1 year is only worth it if we reduce P(doom) by 1%.
This means you’re P(doom) given our current trajectory very much matters. If you’re P(doom) is <1%, then pushing back a year isn’t worth it.
The expected change conditioning on accelerating also matters. If accelerating by 1 year increases e.g. global tensions, increasing a war between nuclear states by X% w/ an expected Y-deaths (I could see arguments either way though, haven’t thought too hard about this).
For me, I’m at ~10% P(doom). Whether I’d accept a proposed slowdown depends on how much I expect it decrease this number.[2]
How do you model this situation? (also curious on your numbers)
Assumptions:
We care about currently living people equally (alternatively, if you cared mostly about your young children, you’d happily accept a reduction in x-risk of 0.1% (possibly even 0.02%). Actuary table here)
Using expected value, which only mostly matches my intuitions (e.g. I’d actually accept pushing back 2 years for a reduction of x-risk from 1% to ~0%)
I mostly care about people I know, some for people in general, and the cosmic endownment would be nice, sure, but only 10% of the value for me.
Most of my (currently living) loved ones skew younger, ~0.5% expected death-rate, so I’d accept a lower expected reduction in x-risk (maybe 0.7%)
Only if you don’t care at all about people who aren’t yet born. I’m assuming that’s your position, but you didn’t state it as one of your two assumptions and I think it’s an important one.
The answer also changes if you believe nonhumans are moral patients, but it’s not clear which direction it changes.
Correct! I did mean to communicate that in the first footnote. I agree value-ing the unborn would drastically lower the amount of acceptable risk reduction.
Note that unborn people are merely potential, as their existence depends on our choices. Future generations aren’t guaranteed—we decide whether or not they will exist, particularly those who might be born decades or centuries from now. This makes their moral status far less clear than someone who already exists or who is certain to exist at some point regardless of our choices.
Additionally, if we decide to account for the value of future beings, we might consider both potential human people and future AI entities capable of having moral value. From a utilitarian perspective, both human and AI welfare presumably matters. This makes the ethical calculus more complicated, as the dilemma isn’t merely about whether we risk losing all future generations, but rather whether we risk shifting posterity from humans to AIs.
Personally, I’m largely comfortable evaluating our actions primarily—though not entirely—based on their impact on current human lives, or at least people (and animals) who will exist in the near-term. I value our present generation. I want us to keep living and to thrive. It would be a tragedy if we either went extinct or died from aging. However, to the extent that I care about distant future generations, my concern is substrate-impartial, and I don’t particularly favor humans over AIs.
Do you care whether AIs are sentient (or, are there particular qualities you expect entities need to be valuable?). Do you basically expect any AI capable of overtaking humans to have those qualities?
(btw, I appreciate that even though you disagree a bunch with several common LW-ish viewpoints you’re still here talking through things)
I am essentially a preference utilitarian and an illusionist regarding consciousness. This combination of views leads me to conclude that future AIs will very likely have moral value if they develop into complex agents capable of long-term planning, and are embedded within the real world. I think such AIs would have value even if their preferences look bizarre or meaningless to humans, as what matters to me is not the content of their preferences but rather the complexity and nature of their minds.
When deciding whether to attribute moral patienthood to something, my focus lies primarily on observable traits, cognitive sophistication, and most importantly, the presence of clear open-ended goal-directed behavior, rather than on speculative or less observable notions of AI welfare, about which I am more skeptical. As a rough approximation, my moral theory aligns fairly well with what is implicitly proposed by modern economists, who talk about revealed preferences and consumer welfare.
Like most preference utilitarians, I believe that value is ultimately subjective: loosely speaking, nothing has inherent value except insofar as it reflects a state of affairs that aligns with someone’s preferences. As a consequence, I am comfortable, at least in principle, with a wide variety of possible value systems and future outcomes. This means that I think a universe made of only paperclips could have value, but only if that’s what preference-having beings wanted the universe to be made out of.
To be clear, I also think existing people have value too, so this isn’t an argument for blind successionism. Also, it would be dishonest not to admit that I am also selfish to a significant degree (along with almost everyone else on Earth). What I have just described simply reflects my broad moral intuitions about what has value in our world from an impartial point of view, not a prescription that we should tile the universe with paperclips. Since humans and animals are currently the main preference-having beings in the world, at the moment I care most about fulfilling what they want the world to be like.
I agree that this sort of preference utilitarianism leads you to thinking that long run control by an AI which just wants paperclips could be some (substantial) amount good, but I think you’d still have strong preferences over different worlds.[1] The goodness of worlds could easily vary by many orders of magnitude for any version of this view I can quickly think of and which seems plausible. I’m not sure whether you agree with this, but I think you probably don’t because you often seem to give off the vibe that you’re indifferent to very different possibilities. (And if you agreed with this claim about large variation, then I don’t think you would focus on the fact that the paperclipper world is some small amount good as this wouldn’t be an important consideration—at least insofar as you don’t also expect that worlds where humans etc retain control are similarly a tiny amount good for similar reasons.)
The main reasons preference utilitarianism is more picky:
Preferences in the multiverse: Insofar as you put weight on the preferences of beings outside our lightcone (beings in the broader spatially infinte universe, Everett branches, the broader mathematical multiverse to the extent you put weight on this), then the preferences of these beings will sometimes care about what happens in our lightcone and this could easily dominate (as they are vastly more numerious and many might care about things independent of “distance”). In the world with the successful paperclipper, just as many preferences aren’t being fulfilled. You’d strongly prefer optimization to satisfy as many preferences as possible (weighted as you end up thinking is best).
Instrumentally constructed AIs with unsatisfied preferences: If future AIs don’t care at all about preference utilitarianism, they might instrumentally build other AIs who’s preferences aren’t fulfilled. As an extreme example, it might be that the best strategy for a paper clipper is to construct AIs which have very different preferences and are enslaved. Even if you don’t care about ensuring beings come into existence who’s preference are satisified, you might still be unhappy about creating huge numbers of beings who’s preferences aren’t satisfied. You could even end up in a world where (nearly) all currently existing AIs are instrumental and have preferences which are either unfulfilled or only partially fulfilled (a earlier AI initiated a system that perpetuates this, but this earlier AI no longer exists as it doesn’t care terminally about self-preservation and the system it built is more efficient than it).
AI inequality: It might be the case that the vast majority of AIs have there preferences unsatisfied despite some AIs succeeding at achieving their preference. E.g., suppose all AIs are replicators which want to spawn as many copies as possible. The vast majority of these replicator AI are operating at subsistence and so can’t replicate making their preferences totally unsatisfied. This could also happen as a result of any other preference that involves constructing minds that end up having preferences.
Weights over numbers of beings and how satisfied they are: It’s possible that in a paperclipper world, there are really a tiny number of intelligent beings because almost all self-replication and paperclip construction can be automated with very dumb/weak systems and you only occasionally need to consult something smarter than a honeybee. AIs could also vary in how much they are satisfied or how “big” their preferences are.
I think the only view which recovers indifference is something like “as long as stuff gets used and someone wanted this at some point, that’s just as good”. (This view also doesn’t actually care about stuff getting used, because there is someone existing who’d prefer the universe stays natural and/or you don’t mess with aliens.) I don’t think you buy this view?
To be clear, it’s not immediately obvious whether a preference utilitarian view like the one you’re talking about favors human control over AIs. It certainly favors control by that exact flavor of preference utilitarian view (so that you end up satisfying people across the (multi-/uni-)verse with the correct weighting). I’d guess it favors human control for broadly similar reasons to why I think more experience-focused utilitarian views also favor human control if that view is in a human.
And, maybe you think this perspective makes you so uncertain about human control vs AI control that the relative impacts current human actions could have are small given how much you weight long term outcomes relative to other stuff (like ensuring currently existing humans get to live for at least 100 more years or similar).
On my best guess moral views, I think there is goodness in the paper clipper universe but this goodness (which isn’t from (acausal) trade) is very small relative to how good the universe can plausibly get. So, this just isn’t an important consideration but I certainly agree there is some value here.
Matthew responds here
Want to try answering my questions/problems about preference utilitarianism?
Maybe I would state my first question above a little differently today: Certain decision theories (such as the UDT/FDT/LDT family) already incorporate some preference-utilitarian-like intuitions, by suggesting that taking certain other agents’ preferences into account when making certain decisions is a good idea, if e.g. this is logically correlated with them taking your preferences into account. Does preference utilitarianism go beyond this, and say that you should take their preferences into account even if there is no decision theoretic reason to do so, as a matter of pure axiology (values / utility function)? Do you then take their preferences into account again as part of decision theory, or do you adopt a decision theory which denies or ignores such correlations/linkages/reciprocities (e.g., by judging them to be illusions or mistakes or some such)? Or does your preference utilitarianism do something else, like deny the division between decision theory and axiology? Also does your utility function contain non-preference-utilitarian elements, i.e., idiosyncratic preferences that aren’t about satisfying other agents’ preferences, and if so how do you choose the weights between your own preferences and other agents’?
(I guess this question/objection also applies to hedonic utilitarianism, to a somewhat lesser degree, because if a hedonic utilitarian comes across a hedonic egoist, he would also “double count” the latter’s hedons, once in his own utility function, and once again if his decision theory recommends taking the latter’s preferences into account. Another alternative that avoids this “double counting” is axiological egoism + some sort of advanced/cooperative decision theory, but then selfish values has its own problems. So my own position on is topic is one of high confusion and uncertainty.)
I put the probability that AI will directly cause humanity to go extinct within the next 30 years at roughly 4%. By contrast, over the next 10,000 years, my p(doom) is substantially higher, as humanity could vanish for many different possible reasons, and forecasting that far ahead is almost impossible. I think a pause in AI development matters most for reducing the near-term, direct AI-specific risk, since the far-future threats are broader, more systemic, harder to influence, and only incidentally involve AI as a byproduct of the fact that AIs will be deeply embedded in our world.
I’m very skeptical that a one-year pause would meaningfully reduce this 4% risk. This skepticism arises partly because I doubt much productive safety research would actually happen during such a pause. In my view, effective safety research depends heavily on an active feedback loop between technological development and broader real-world applications and integration, and pausing the technology would essentially interrupt this feedback loop. This intuition is also informed by my personal assessment of the contributions LW-style theoretical research has made toward making existing AI systems safe—which, as far as I can tell, has been almost negligible (though I’m not implying that all safety research is similarly ineffective or useless).
I’m also concerned about the type of governmental structures and centralization of power required to enforce such a pause. I think pausing AI would seriously risk creating a much less free and dynamic world. Even if we slightly reduce existential risks by establishing an international AI pause committee, we should still be concerned about the type of world we’re creating through such a course of action. Some AI pause proposals seem far too authoritarian or even totalitarian to me, providing another independent reason why I oppose pausing AI.
Additionally, I think that when AI is developed, it won’t merely accelerate life-extension technologies and save old people’s lives; it will likely also make our lives vastly richer and more interesting. I’m excited about that future, and I want the 8 billion humans alive today to have the opportunity to experience it. This consideration adds another important dimension beyond merely counting potential lives lost, again nudging me towards supporting acceleration.
Overall, the arguments in favor of pausing AI seem surprisingly weak to me, considering the huge potential upsides from AI development, my moral assessment of the costs and benefits, my low estimation of the direct risk from misaligned AI over the next 30 years, and my skepticism about how much pausing AI would genuinely reduce AI risks.
I’m going to try to quickly make the case for the value of a well-timed 2-year pause which occurs only in some conditions (conditions which seem likely to me but which probably seem unlikely to you). On my views, such a pause would cut the risk of misaligned AI takeover (as in, an AI successfully seizing a large fraction of power while this is unintended by its de facto developers) by around 1⁄2 or maybe 1⁄3.[1]
I think the ideal (short) pause/halt/slowdown from my perspective would occur around the point when AIs are capable enough to automate all safety relevant work and would only halt/slow advancement in general underlying capability. So, broader real-world applications and integrations could continue as well as some types of further AI development which don’t improve generally applicable capabilities. (It might also be acceptable to train cheaper or faster AIs and to improve algorithms but not yet train an AI which substantially surpasses this fixed level of general ability.)
A bunch of the reason why I think a well-timed slowdown might be good is that default takeoff speeds might be very fast. For instance, you might go from something like the superhuman AI researcher level (AIs which are qualitatively similar in general capabilities to human experts and which can automate AI R&D) to very qualitatively superhuman AIs in less than a year, and possibly (as in the case of AI 2027) in less than 4 months. If these takeoff speeds are what would happen by default, this transition probably requires either slowing down or very quickly handing off alignment and safety work to (hopefully sufficiently aligned) AIs which naively seems very scary.
Note that in this fast of a takeoff, we might only have AIs which are sufficiently capable that a full (safe) handoff is in principle viable for a few months before we need to do this handoff. So, humans wouldn’t have time to see much of a feedback loop on deferring to these AIs and handing off the types of work we will ultimately need to hand off. In other words, the default pace of takeoff speeds would itself disrupt the feedback loops typically needed for safety research. We’d have some sense of what AIs are like based on earlier capabilities and we could try to extrapolate, but capabilities might be improving fast enough that our prior experience doesn’t transfer. Further, handing off extremely open-ended “wicked” tasks which are impossible for humans to directly verify/test might result in risks and difficulties which didn’t show up before.
My understanding is that you don’t think this fast of a takeoff is very likely and this informs your views on slowing down in the future. What about making the slowdown conditional on takeoff looking roughly this fast? We could look at how much AIs are accelerating progress and try to forecast various milestones, and then only slow down as is needed to ensure that the duration between “superhuman AI researcher level” and very qualitatively superhuman AI is at least 2.5 years. If the default duration looks like it will be much more than this, then no slowdown is needed. And, maybe on your views you think the default duration is very likely to be longer than 2.5 years?
If the default is that we’re going from human-ish level AIs to radically superhuman AIs in less than a year, then I think there is a pretty strong intuitive case that a slowdown considerably improves the chance that humans stay in control (at least temporarily). And, if you’re more worried about fatalities, the picture is similar (at least insofar as you agree that human AI developers rapidly losing control to AI systems will cause higher human fatalities).
Here’s a another way to put this case for delay conditional on quite fast takeoff: By default, at the point of full AI R&D automation humans might have only another few months of time to work on alignment prior to needing to handoff to AI systems (or some worse option). Additionally, only around 3-20% of the effort would be focused on safety relevant alignment by default. If instead add 2 years of delay and spend most of the effort in these years on alignment, that would increase from a few months at maybe 7% effort to 2 years at maybe 70% effort. This is a 10x increase in serial time and a 10x increase in effort during that serial time. There would be substantial prior work on alignment, but it might not transfer sufficiently (and might just not be sufficient given that earlier work wasn’t accelerated by AIs), so work during this period could be a key bottleneck. Most guesses about returns to effort would have this reduce risk by a decent amount given the large increase increase in overall effort and time while having access to a critical level of AI systems. More simply: Prima fascia, it seems like putting in much more work on a problem would be very helpful.
We’d also want to make the slowdown conditional on not immediately having sufficiently robust alignment that we’re quite confident rapidly handing off is safe. But, I’m quite skeptical we’ll have this quickly (and I’d guess you’d agree?) so I don’t think this makes a big difference to the bottom line.
Of course, there would still be serious implementation difficulties in actually implementing a well-timed conditional slowdown. And, operationalizing the exact criteria would be important.
Interestingly, I have the opposite view: a well-timed slowdown would probably reduce concentration of power, at least if takeoff would otherwise have been fast. If takeoff is quite fast, then the broader world won’t have much time to respond to developments which would make it more likely that power would greatly concentrate by default. People would need time to notice the situation and take measures to avoid being disempowered. As a more specific case, AI-enabled coups seem much more likely if takeoff is fast and thus intervening to slow down takeoff (so there is more time for various controls etc. to be put in place) would help a lot with that.
I think this effect is substantially larger than the (centralization, less dynamism, etc.) costs needed to enforce a 1-2 year slowdown. (Separately, I expect things probably will be so concentrated by default, that the additional requirements to enforce a 1-2 year slowdown seem pretty negligible in comparison. I can easily imagine the deals etc. made to enforce a slowdown decentralizing power on net (as it would require oversight by a larger number of actors and more humans to get some influence over the situation), though this presumably wouldn’t be the easiest way to achieve this objective. I think a situation pretty similar to the AI 2027 scenario where an extremely small group of people have massive de facto power is quite likely, and this could easily result in pretty close to maximal concentration of power longer term.)
Suppose we could do a reasonable job implementing a conditional slowdown like this where we try to ensure at least a 2.5 year gap (if alignment issues aren’t robustly solved) between full AI R&D automation and very qualitatively superhuman AI. Do you think such a slowdown would be good on your views and values?
My views are that misaligned AI takeover is about 30% likely. Conditional on misaligned AI takeover, I’d guess (with very low confidence) that maybe 1⁄2 of humans die in expectation with a 1⁄4 chance of literal human extinction. Interestingly, this means we don’t disagree that much about the chance that AI will directly cause humanity to go extinct in the next 30 years, I’d put around 6% on this claim and you’re at 4%. (6% = 85% chance of TAI, 30% takeover conditional on TAI, 25% chance of extinction.) However, as found in prior conversations, we do disagree a bunch on how bad misaligned AI takeover is for various reasons. It’s also worth noting that in some worlds where humans survive, they (or some fraction of them) might be mistreated by the AI systems with power over them in ways which make their lives substantially worse than they are now. So, overall, my sense is that from a myopic perspective that only cares about the lives of currently aligned humans, misaligned AI takeover is roughly as bad as 3⁄5 of people dying in expectation. So, if we think each year of delay costs the equivalent of 0.5% of humans dying and we only care about currently living humans, then a ~1/40th reduction in takeover risk is worth a year of delay on my views.
I know what you mean by “LW-style theoretical research” (edit: actually, not that confident I know what you mean, see thread below), but it’s worth noting that right now on LW people appear to be much more into empirical research than theoretical research. Concretely, go to All posts in 2024 sorted by Top and then filter by AI. Out of the top 32 posts, 0 are theoretical research and roughly 7⁄32 are empirical research. 1 or 2 out of 32 are discussion which is relatively pro-theoretical research and a bunch more (maybe 20) are well described as AI futurism or discussion of what research directions or safety strategies are best which is relatively focused on empirical approaches. LW has basically given up on LW-style theoretical research based on the top 32 posts. (One of the top 32 posts is actually a post which is arguably complaining about how the field of alignment has given up on LW-style theoretical research!)
Separately, I don’t think pessimism about LW-style theoretical research has a clear cut effect on how you should feel about a pause. The more you’re skeptical of work done in advance, the more you should think that additional work at done when we have more powerful AIs is a higher fraction of the action. This could be outweighed by generally being more skeptical about the returns to safety research as informed by this example subfield of safety research being poor, but still.
Also, it’s worth noting that almost everyone in the field is pessimistic about LW-style theoretical research! This isn’t a very controversial view. The main disagreements (at least on LW) tend to be more about how optimistic you are about empirical research and about different types of empirical research.
(I will go on the record that I think this comment seems to me terribly confused about what “LW style theoretic research” is. In-particular, I think of Redwood as one of the top organizations doing LW style theoretic research, with a small empirical component, and so clearly some kind of mismatch about concepts is going on here. AI 2027 also strikes me as very centrally the kind of “theoretical” thinking that characterizes LW.
My sense is some kind of weird thing is happening where people conjure up some extremely specific thing as the archetype of LW-style research, in ways that is kind of disconnected from reality, and I would like to avoid people forming annoyingly hard to fix stereotypes as a result of that)
I’m using the word “theoretical” more narrowly than you and not including conceptual/AI-futurism research. I agree the word “theoretical” is underdefined and there is a reasonable category that includes Redwood and AI 2027 which you could call theoretical research, I’d just typically use a different term for this and I don’t think Matthew was including this.
I was trying to discuss what I thought Matthew was pointing at, I could be wrong about this of course.
(Similarly, I’d guess that Matthew wouldn’t have counted Epoch’s work on takeoff speeds and what takeoff looks like as an example of “LW-style theoretical research”, but I think this work is very structurally/methodologically similar to stuff like AI 2027.”)
If Matthew said “LW-style conceptual/non-empirical research” I would have interpreted this pretty differently.
I am clearly coming from a very different set of assumptions! I have:
P(AGI within 10 years) = 0.5. This is probably too conservative, given that many of the actual engineers with inside knowledge place this number much higher in anonymous surveys.
P(ASI within 5 years|AGI) = 0.9.
P(loss of control within 5 years|ASI) > 0.9. Basically, I believe “alignment” is a fairy tale, that it’s Not Even Wrong.
If I do the math, that gives me a 40.5% chance that humans will completely lose control over the future within 20 years. Which seems high to me at first glance, but I’m willing to go with that.
The one thing I can’t figure out how to estimate is:
P(ASI is benevolent|uncontrolled ASI) = ???
I think that there are only a few ways the future is likely to go:
AI progress hits a wall, hard.
We have a permanent, worldwide moratorium on more advanced models. Picture a US/China/EU treaty backed up by military force, if you want to get dystopian about it.
An ASI decides humans are surplus to requirements.
An ASI decides that humans are adorable pets and it wants keep some of this around. This is the only place we get any “utopian” benefits, and it’s the utopia of being a domesticated animal with no ability to control its fate.
I support a permanent halt. I have no expectation that this will happen. I think building ASI is equivalent to BASE jumping in a wingsuit, except even more likely to end horribly.
So I also support mitigation and delay. If the human race has incurable, metastatic cancer, the remaining variable we control is how many good years we get before the end.
Could you give the source(s) of these anonymous surveys of engineers with insider knowledge about the arrival of AGI? I would be interested in seeing them.
Unfortunately, it was about 3 or 4 months ago, and I haven’t been able to find the source. Maybe something Zvi Mowshowitz linked to in a weekly update?
I am incredibly frustrated that web search is a swamp of AI spam, and tagged bookmarking tools like Delicious and Pinboard have been gone or unreliable for years.
That would imply that if you could flip a switch which 90% chance kills everyone, 10% chance grants immortality then (assuming there weren’t any alternative paths to immortality) you would take it. Is that correct?
Gut reaction is “nope!”.
Could you spell out the implication?
Many of these arguments seem pathological when applied to an individual.
I have a friend, let’s call her B, she has a 6 year old daughter A. She of course adores her daughter. If I walked up to B and said “I’m going to inject this syringe into your daughter. There’s a 10% chance it’ll kill her, and a 50% chance it’ll extend her natural lifetime to 200.” Then I jab A.
EV on A’s life expectancy is strongly positive. B (and almost everybody) would be very upset if I did this. I’m upset with accelerationists for the same reasons.
This has some similarities with early smallpox variolation, right? (And some differences, like the numbers.)
Depending on your AI timelines :p