Given the counteraguments, I don’t see a reason to think this more than single-digit-percent likely to be especially relevant. (I can see >9% likelihood the AIs are “nice enough that something interesting-ish happens” but not >9% likelihood that we shouldn’t think the outcome is still extremely bad. The people who think otherwise seem extremely motivatedly-cope-y to me).
I think the arguments given in the online supplement for “AIs will literally kill every single human” fail to engage with the best counterarguments in a serious way. I get the sense that many people’s complaints are of this form: the book does a bad job engaging with the strongest counterarguments in a way that is epistemically somewhat bad. (Idk if it violates group epistemic norms, but it seems like it is probably counterproductive. I guess this is most similar to complaint #2 in your breakdown.)
Specifically:
They fail to engage with the details of “how cheap is it actually for the AI to keep humans alive” in this section. Putting aside killing humans as part of a takeover effort, avoiding boiling the oceans (or eating the biosphere etc) maybe delays you for something like a week to a year. Each year costs you ~1/3 billionth of resources, so this is actually very cheap if you care in a scope-sensitive and patient way. Additionally, keeping humans alive through boiling the oceans might be extremely cheap, especially given very fast takeoff; this might lower costs to maybe more like 1/trillion. Regardless, this is much cheaper and much more salient than “keeping a pile of 41 stones in your house”. (I’d guess you’d have to pay most American households more than a tenth of a penny to keep a pile of stones in their house.)
They don’t talk at all about trade arguments for keeping humans alive while these are a substantial fraction of the case, (edit:) aside from this footnote which doesn’t really engage in a serious way[1] (and is a footnote). (This doesn’t count.)
The argument that “humans wouldn’t actually preserve the environment” misses the relevant analogy which is more like “humans come across some intelligent aliens who say they want to be left alone and leaving these aliens alone is pretty cheap from our perspective, but these aliens aren’t totally optimized from our perspective, e.g. they have more suffering than we’d like in their lives”. This situation wouldn’t result in us doing something to the aliens that they consider as bad as killing them all, so the type of kindness humans have is actually sufficient.
Note also that it’s very expensive for the AI to not boil the oceans / etc as fast as possible, since that means losing a many galaxies worth of resources, so it seems like it’s not enough to be “very slightly” nice – it has to be, like, pretty actively nice.
For a patient AI, it costs something like 1 / billion to 1⁄300 billion of resources which seems extremely cheap. E.g., for a person with a net worth of $1 million, it require them to spare a tenth of a penny to a thousandth of a penny. I think this counts as “very slightly nice”? It seems pretty misleading to describe this as “very expensive”, though I agree the total amount of resources is large in a absolute sense.
There are some flavors of “AI might be slightly nice” that are interesting. But, they don’t seem like it changes any of our decisions. It just makes us a bit more hopeful about the end result.
For instance, why not think this results in a reasonable chance of the humans surviving in their normal physical bodies and being able to live the lives they want to live rather than being in an “alien zoo”.
I don’t have much time to engage rn and probably won’t be replying much, but some quick takes:
a lot of my objection to superalignment type stuff is a combination of: (a) “this sure feels like that time when people said ’nobody would be dumb enough to put AIs on the internet; they’ll be kept in a box” and eliezer argued “even then it could talk its way out of the box,” and then in real life AIs are trained on servers that are connected to the internet, with evals done only post-training. the real failure is that earth doesn’t come close to that level of competence. (b) we predictably won’t learn enough to stick the transition between “if we’re wrong we’ll learn a new lesson” and “if we’re wrong it’s over.” i tried to spell these true-objections out in the book. i acknowledge it doesn’t go to the depth you might think the discussion merits. i don’t think there’s enough hope there to merit saying more about it to a lay audience. i’m somewhat willing to engage with more-spelled-out superalignment plans, if they’re concrete enough to critique. but it’s not my main crux; my main cruxes are that it’s superficially the sort of wacky scheme that doesn’t cross the gap between Before and After on the first try in real life, and separately that the real world doesn’t look like any past predictions people made when they argued it’ll all be okay because the future will handle things with dignity; the real world looks like a place that generates this headline.
my answer to how cheap is it actually for the AI to keep humans alive is not “it’s expensive in terms of fractions of the universe” but rather “it’d need a reason”, and my engagement with “it wouldn’t have a reason” is mostly here, rather than the page you linked.
my response to the trade arguments as I understand them is here plus in the footnotes here. If this is really the key hope held by the world’s reassuring voices, I would prefer that they just came out and said it plainly, in simple words like “I think AI will probably destroy almost everything, but I think there’s a decent chance they’ll sell backups of us to distant aliens instead of leaving us dead” rather than in obtuse words like “trade arguments”.
If humans met aliens that wanted to be left alone, it seems to me that we sure would peer in and see if they were doing any slavery, or any chewing agonizing tunnels through other sentient animals, or etc. The section you linked is trying to make an argument like: “Humans are not a mixture of a bunch of totally independent preferences; the preferences interleave. If AI cares about lots of stuff like how humans care about lots of stuff, it probably doesn’t look like humans getting a happy ending to tiny degree, as opposed to humans getting a distorted ending.” Maybe you disagree with this argument, but I dispute that I’m not even trying to engage with the core arguments as I understand them (while also trying to mostly address a broad audience rather than what seems-to-me like a weird corner that locals have painted themselves into, in a fashion that echos the AI box arguments of the past).
It seems pretty misleading to describe this as “very expensive”, though I agree the total amount of resources is large in a absolute sense.
Yep, “very expensive” was meant in an absolute sense (e.g., in terms of matter and energy), not in terms of universe-fractions. But the brunt of the counterargument is not “the cost is high as a fraction of the universe”, it’s “the cost is real so the AI would need some reason to pay it, and we don’t know how to get that reason in there.” (And then in anticipation of “maybe the AI values almost everything a little, because it’s a mess just like us?”, I continue: “Messes have lots of interaction between the messy fragments, rather than a clean exactly-what-humans-really-want component that factors out at some low volume on the order of 1 in a billion part. If the AI gets preferences vaguely about us, it wouldn’t be pretty.” And then in anticipation of: “Okay maybe the AI doesn’t wind up with much niceness per se, but aren’t there nice aliens who would buy us?”, I continue: “Sure, could happen, that merits a footnote. But also can we back up and acknowledge how crazy of a corner we’ve wandered into here?”) Again: maybe you disagree with my attempts to engage with the hard Qs, but I dispute the claim that we aren’t trying.
(ETA: Oh, and if by “trade arguments” you mean the “ask weak AIs for promises before letting them become strong” stuff rather than the “distant entities may pay the AI to be nice to us” stuff, the engagement is here plus in the extended discussion linked from there, rather than in the section you linked.)
Also: I find it surprising and sad that so many EAs/rats are responding with something like: “The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs, and it does not do enough justice to my acausal-trade theory of why AI will ruin the future and squander the cosmic endowment but maybe allow current humans to live out a short happy ending in an alien zoo. So unfortunately I cannot signal boost this book.” rather than taking the opportunity to say “Yeah holy hell the status quo is insane and the world should stop; I have some ideas that the authors call “alchemist schemes” that I think have a decent chance but Earth shouldn’t be betting on them and I’d prefer we all stop.” I’m still not quite sure what to make of it.
(tbc: some EAs/rats do seem to be taking the opportunity, and i think that’s great)
The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs
FWIW that’s not at all what I mean (and I don’t know of anyone who’s said that). What I mean is much more like what Ryan said here:
I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it’s unlikely that we’ll be using current methods and current understanding (unless humans have already lost control by this point, which seems totally plausibly, but not overwhelmingly likely nor argued convincingly for by the book). Even just looking at capabilities, I think it’s pretty likely that automated AI R&D will result in us operating in a totally different paradigm by the time we build superintelligence—this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.
FWIW, I don’t really consider my self to be responding to the book at all (in a way that is public or salient to your relevant audience) and my reasons for not signal boosting the book aren’t really downstream of the content in the book in the way you describe. (More like, I feel sign uncertain about making You/Eliezer more prominant as representatives of the “avoid AI takeover movement” for a wide variety of reasons and think this effect dominates. And I’m not sure I want to be in the business of signal boosting books, though this is less relevant.)
To clarify my views on “will misaligned AIs that succeed in seizing all power have a reasonable chance of keeping (most/many) humans alive”:
I think this isn’t very decision relevant and is not that important. I think AI takeover kills the majority of humans in expectation due to both the takeover itself and killing humans after (as as side effect of industrial expansion, eating the biosphere, etc.) and there is a substantial chance of literal every-single-human-is-dead extinction conditional on AI takeover (30%?). Regardless it destroys most of the potential value of the long run future and I care mostly about this.
So at least for me it isn’t true that “this is really the key hope held by the world’s reassuring voices”. When I discuss how I think about AI risk, this mostly doesn’t come up and when it does I might say something like “AI takeover would probably kill most people and seems extremely bad overall”. Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?
The reason why I brought up this topic is because I think it’s bad to make incorrect or weak arguments:
I think smart people will (correctly) notice these arguments seem motivated or weak and then on the basis of this epistemic spot check dismiss the rest. In argumentation, avoiding overclaiming has a lot of rhetorical benefits. I was using “but will the AI actually kill everyone” as an example of this. I think the other main case is “before superintelligence, will we be able to get a bunch of help with alignment work?” but there are other examples.
Worse, bad arguments/content result in negative polarization of somewhat higher context people who might otherwise have been somewhat sympathetic or at least indifferent. This is especially costly from the perspective of getting AI company employees to care. I get that you don’t care (much) about AI company employees because you think that radical change is required for their to be any hope, but I think marginal increases in caring among AI company employees substantially reduce risk (though aren’t close to sufficient for the situation being at all reasonable/safe).
Confidently asserted bad arguments and things people strongly disagree make it harder for people to join a coalition. Like, from an integrity perspective, I would need to caveat saying I agree with the book even though I do agree with large chunks of the book and the extent to which I feel the need to caveat this could be reduced. IDK how much you should care about this, but insofar as you care about people like me joining some push you’re trying to make happen this sort of thing makes some difference.
I do think this line of argumentation makes the title literally wrong even if I thought the probability of AI takeover was much higher. I’m not sure how much to care about this, but I do think it randomly imposes a bunch of costs to brand things as “everyone dies” when a substantial fraction of the coalition you might want to work with disagrees and it isn’t a crux. Like, does the message punchyness outweight the costs here from your perspective? IDK.
Responding to some specific points:
a lot of my objection to superalignment type stuff is a combination of:
I agree that automating alignment with AIs is pretty likely to go very poorly due to incompetence. I think this could go either way and further effort on trying to make this go better is a pretty cost-effective (in terms of using our labor etc) to marginally reduce doom, though it isn’t going to result in a reasonable/safe situation.
that the real world doesn’t look like any past predictions people made when they argued it’ll all be okay because the future will handle things with dignity
To be clear, I don’t think things will be OK exactly nor do I expect that much dignity, though I think I do expect more dignity than you do. My perspective is more like “there seem like there are some pretty effective ways to reduce doom at the margin” than “we’ll be fine because XYZ”.
my response to the trade arguments as I understand them is here plus in the footnotes here
I don’t think this seriously engages with the argument, though due to this footnote, I retract “they don’t talk at all about trade arguments for keeping humans alive” (I edited my comment).
As far as this section, I agree that it’s totally fine to say “everybody dies” if it’s overwhelmingly more likely everyone dies. I don’t see how this responds to the argument that “it’s not overwhelming likely everyone dies because of acausal (and causal) trade”. I don’t know how important this is, but I also don’t know why you/Eliezer/MIRI feel like it’s so important to argue against this as opposed to saying something like: “AI takeover seems extremely bad and like it would at least kill billions of us. People disagree on exactly how likely vast numbers of humans dying as a result of AI takeover is, but we think it’s at least substantial due to XYZ”. Is it just because you want to use the “everybody dies” part of the title? Fair enough I guess...
If humans met aliens that wanted to be left alone, it seems to me that we sure would peer in and see if they were doing any slavery, or any chewing agonizing tunnels through living animals, or etc.
Sure, but would the outcome for the aliens be as bad or worse than killing all of them from their perspective? I’m skeptical.
Ty! For the record, my reason for thinking it’s fine to say “if anyone builds it, everyone dies” despite some chance of survival is mostly spelled out here. Relative to the beliefs you spell out above, I think the difference is a combination of (a) it sounds like I find the survival scenarios less likely than you do; (b) it sounds like I’m willing to classify more things as “death” than you are.
For examples of (b): I’m pretty happy to describe as “death” cases where the AI makes things that are to humans what dogs are to wolves, or (more likely) makes some other strange optimized thing that has some distorted relationship to humanity, or cases where digitized backups of humanity are sold to aliens, etc. I feel pretty good about describing many exotic scenarios as “we’d die” to a broad audience, especially in a setting with extreme length constraints (like a book title). If I were to caveat with “except maybe backups of us will be sold to aliens”, I expect most people to be confused and frustrated about me bringing that point up. It looks to me like most of the least-exotic scenarios are ones that rout through things that lay audience members pretty squarely call “death”.
It looks to me like the even more exotic scenarios (where modern individuals get “afterlives”) are in the rough ballpark of quantum immortality / anthropic immortality arguments. AI definitely complicates things and makes some of that stuff more plausible (b/c there’s an entity around that can make trades and has a record of your mind), but it still looks like a very small factor to me (washed out e.g. by alien sales) and feels kinda weird and bad to bring it up in a lay conversation, similar to how it’d be weird and bad to bring up quantum immortality if we were trying to stop a car speeding towards a cliff.
FWIW, insofar as people feel like they can’t literally support the title because they think that backups of humans will be sold to aliens, I encourage them to say as much in plain language (whenever they’re critiquing the title). Like: insofar as folks think the title is causing lay audiences to miss important nuance, I think it’s an important second-degree nuance that the allegedly-missing nuance is “maybe we’ll be sold to aliens”, rather than something less exotic than that.
(b) it sounds like I’m willing to classify more things as “death” than you are.
I don’t think this matters much. I’m happy to consider non-consensual uploading to be death and I’m certainly happy to consider “the humans are modified in some way they would find horrifying (at least on reflection)” to be death. I think “the humans are alive in the normal sense of alive” is totally plausible and I expect some humans to be alive in the normal sense of alive in the majority of worlds where AIs takeover.
Making uploads is barely cheaper than literally keeping physical humans alive after AIs have fully solidified their power I think, maybe 0-3 OOMs more expensive or something, so I don’t think non-consensual uploads are that much of the action. (I do think rounding humans up into shelters is relevant.)
(To answer your direct Q, re: “Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?”, I have heard “well I don’t think it will actually kill everyone because of acausal trade arguments” enough times that I assumed the people discussing those cases thought the argument was substantial. I’d be a bit surprised if none of the ECLW folks thought it was a substantial reason for optimism. My impression from the discussions was that you & others of similar prominence were in that camp. I’m heartened to hear that you think it’s insubstantial. I’m a little confused why there’s been so much discussion around it if everyone agrees it’s insubstantial, but have updated towards it just being a case of people who don’t notice/buy that it’s washed out by sale to hubble-volume aliens and who are into pedantry. Sorry for falsely implying that you & others of similar prominence thought the argument was substantial; I update.)
(I mean, I think it’s a substantial reason to think that “literally everyone dies” is considerably less likely and makes me not want to say stuff like “everyone dies”, but I just don’t think it implies much optimism exactly because the chance of death still seems pretty high and the value of the future is still lost. Like I don’t consider “misaligned AIs have full control and 80% of humans survive after a violent takeover” to be a good outcome.)
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
Fwiw I do just straightforwardly agree that “they might be slightly nice, and it’s really cheap” is a fine reason to disagree with the literal title. I have some odds on this, and a lot of model uncertainty about this.
The argument that “humans wouldn’t actually preserve the environment” misses the relevant analogy which is more like “humans come across some intelligent aliens who say they want to be left alone and leaving these aliens alone is pretty cheap from our perspective, but these aliens aren’t totally optimized from our perspective, e.g. they have more suffering than we’d like in their lives”. This situation wouldn’t result in us doing something to the aliens that they consider as bad as killing them all, so the type of kindness humans have is actually sufficient.
A thing that is cruxy to me here is that the sort of thing real life humans have done is get countries addicted to opium so they can control their economy, wipe out large swaths of a population while relocating the survivors to reservations, carving up a continent for the purposes of a technologicaly powerful coalition, etc.
Superintelligences would be smarter that Europeans and have an easier time doing things we’d consider moral, but I also think Europeans would be dramatically nicer than AIs.
I can imagine the “it’s just sooooo cheap, tho” argument winning out. I’m not saying these considerations add up to “it’s crazy to think think they’d be slightly nice.” But, it doesn’t feel very likely to me.
I think the arguments given in the online supplement for “AIs will literally kill every single human” fail to engage with the best counterarguments in a serious way. I get the sense that many people’s complaints are of this form: the book does a bad job engaging with the strongest counterarguments in a way that is epistemically somewhat bad. (Idk if it violates group epistemic norms, but it seems like it is probably counterproductive. I guess this is most similar to complaint #2 in your breakdown.)
Specifically:
They fail to engage with the details of “how cheap is it actually for the AI to keep humans alive” in this section. Putting aside killing humans as part of a takeover effort, avoiding boiling the oceans (or eating the biosphere etc) maybe delays you for something like a week to a year. Each year costs you ~1/3 billionth of resources, so this is actually very cheap if you care in a scope-sensitive and patient way. Additionally, keeping humans alive through boiling the oceans might be extremely cheap, especially given very fast takeoff; this might lower costs to maybe more like 1/trillion. Regardless, this is much cheaper and much more salient than “keeping a pile of 41 stones in your house”. (I’d guess you’d have to pay most American households more than a tenth of a penny to keep a pile of stones in their house.)
They don’t talk at all about trade arguments for keeping humans alive while these are a substantial fraction of the case, (edit:) aside from this footnote which doesn’t really engage in a serious way[1] (and is a footnote). (This doesn’t count.)
The argument that “humans wouldn’t actually preserve the environment” misses the relevant analogy which is more like “humans come across some intelligent aliens who say they want to be left alone and leaving these aliens alone is pretty cheap from our perspective, but these aliens aren’t totally optimized from our perspective, e.g. they have more suffering than we’d like in their lives”. This situation wouldn’t result in us doing something to the aliens that they consider as bad as killing them all, so the type of kindness humans have is actually sufficient.
For a patient AI, it costs something like 1 / billion to 1⁄300 billion of resources which seems extremely cheap. E.g., for a person with a net worth of $1 million, it require them to spare a tenth of a penny to a thousandth of a penny. I think this counts as “very slightly nice”? It seems pretty misleading to describe this as “very expensive”, though I agree the total amount of resources is large in a absolute sense.
For the record, I agree with this.
For instance, why not think this results in a reasonable chance of the humans surviving in their normal physical bodies and being able to live the lives they want to live rather than being in an “alien zoo”.
I don’t have much time to engage rn and probably won’t be replying much, but some quick takes:
a lot of my objection to superalignment type stuff is a combination of: (a) “this sure feels like that time when people said ’nobody would be dumb enough to put AIs on the internet; they’ll be kept in a box” and eliezer argued “even then it could talk its way out of the box,” and then in real life AIs are trained on servers that are connected to the internet, with evals done only post-training. the real failure is that earth doesn’t come close to that level of competence. (b) we predictably won’t learn enough to stick the transition between “if we’re wrong we’ll learn a new lesson” and “if we’re wrong it’s over.” i tried to spell these true-objections out in the book. i acknowledge it doesn’t go to the depth you might think the discussion merits. i don’t think there’s enough hope there to merit saying more about it to a lay audience. i’m somewhat willing to engage with more-spelled-out superalignment plans, if they’re concrete enough to critique. but it’s not my main crux; my main cruxes are that it’s superficially the sort of wacky scheme that doesn’t cross the gap between Before and After on the first try in real life, and separately that the real world doesn’t look like any past predictions people made when they argued it’ll all be okay because the future will handle things with dignity; the real world looks like a place that generates this headline.
my answer to how cheap is it actually for the AI to keep humans alive is not “it’s expensive in terms of fractions of the universe” but rather “it’d need a reason”, and my engagement with “it wouldn’t have a reason” is mostly here, rather than the page you linked.
my response to the trade arguments as I understand them is here plus in the footnotes here. If this is really the key hope held by the world’s reassuring voices, I would prefer that they just came out and said it plainly, in simple words like “I think AI will probably destroy almost everything, but I think there’s a decent chance they’ll sell backups of us to distant aliens instead of leaving us dead” rather than in obtuse words like “trade arguments”.
If humans met aliens that wanted to be left alone, it seems to me that we sure would peer in and see if they were doing any slavery, or any chewing agonizing tunnels through other sentient animals, or etc. The section you linked is trying to make an argument like: “Humans are not a mixture of a bunch of totally independent preferences; the preferences interleave. If AI cares about lots of stuff like how humans care about lots of stuff, it probably doesn’t look like humans getting a happy ending to tiny degree, as opposed to humans getting a distorted ending.” Maybe you disagree with this argument, but I dispute that I’m not even trying to engage with the core arguments as I understand them (while also trying to mostly address a broad audience rather than what seems-to-me like a weird corner that locals have painted themselves into, in a fashion that echos the AI box arguments of the past).
Yep, “very expensive” was meant in an absolute sense (e.g., in terms of matter and energy), not in terms of universe-fractions. But the brunt of the counterargument is not “the cost is high as a fraction of the universe”, it’s “the cost is real so the AI would need some reason to pay it, and we don’t know how to get that reason in there.” (And then in anticipation of “maybe the AI values almost everything a little, because it’s a mess just like us?”, I continue: “Messes have lots of interaction between the messy fragments, rather than a clean exactly-what-humans-really-want component that factors out at some low volume on the order of 1 in a billion part. If the AI gets preferences vaguely about us, it wouldn’t be pretty.” And then in anticipation of: “Okay maybe the AI doesn’t wind up with much niceness per se, but aren’t there nice aliens who would buy us?”, I continue: “Sure, could happen, that merits a footnote. But also can we back up and acknowledge how crazy of a corner we’ve wandered into here?”) Again: maybe you disagree with my attempts to engage with the hard Qs, but I dispute the claim that we aren’t trying.
(ETA: Oh, and if by “trade arguments” you mean the “ask weak AIs for promises before letting them become strong” stuff rather than the “distant entities may pay the AI to be nice to us” stuff, the engagement is here plus in the extended discussion linked from there, rather than in the section you linked.)
Also: I find it surprising and sad that so many EAs/rats are responding with something like: “The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs, and it does not do enough justice to my acausal-trade theory of why AI will ruin the future and squander the cosmic endowment but maybe allow current humans to live out a short happy ending in an alien zoo. So unfortunately I cannot signal boost this book.” rather than taking the opportunity to say “Yeah holy hell the status quo is insane and the world should stop; I have some ideas that the authors call “alchemist schemes” that I think have a decent chance but Earth shouldn’t be betting on them and I’d prefer we all stop.” I’m still not quite sure what to make of it.
(tbc: some EAs/rats do seem to be taking the opportunity, and i think that’s great)
FWIW that’s not at all what I mean (and I don’t know of anyone who’s said that). What I mean is much more like what Ryan said here:
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
I would say my summary for hope is more like:
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.
FWIW, I don’t really consider my self to be responding to the book at all (in a way that is public or salient to your relevant audience) and my reasons for not signal boosting the book aren’t really downstream of the content in the book in the way you describe. (More like, I feel sign uncertain about making You/Eliezer more prominant as representatives of the “avoid AI takeover movement” for a wide variety of reasons and think this effect dominates. And I’m not sure I want to be in the business of signal boosting books, though this is less relevant.)
To clarify my views on “will misaligned AIs that succeed in seizing all power have a reasonable chance of keeping (most/many) humans alive”:
I think this isn’t very decision relevant and is not that important. I think AI takeover kills the majority of humans in expectation due to both the takeover itself and killing humans after (as as side effect of industrial expansion, eating the biosphere, etc.) and there is a substantial chance of literal every-single-human-is-dead extinction conditional on AI takeover (30%?). Regardless it destroys most of the potential value of the long run future and I care mostly about this.
So at least for me it isn’t true that “this is really the key hope held by the world’s reassuring voices”. When I discuss how I think about AI risk, this mostly doesn’t come up and when it does I might say something like “AI takeover would probably kill most people and seems extremely bad overall”. Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?
The reason why I brought up this topic is because I think it’s bad to make incorrect or weak arguments:
I think smart people will (correctly) notice these arguments seem motivated or weak and then on the basis of this epistemic spot check dismiss the rest. In argumentation, avoiding overclaiming has a lot of rhetorical benefits. I was using “but will the AI actually kill everyone” as an example of this. I think the other main case is “before superintelligence, will we be able to get a bunch of help with alignment work?” but there are other examples.
Worse, bad arguments/content result in negative polarization of somewhat higher context people who might otherwise have been somewhat sympathetic or at least indifferent. This is especially costly from the perspective of getting AI company employees to care. I get that you don’t care (much) about AI company employees because you think that radical change is required for their to be any hope, but I think marginal increases in caring among AI company employees substantially reduce risk (though aren’t close to sufficient for the situation being at all reasonable/safe).
Confidently asserted bad arguments and things people strongly disagree make it harder for people to join a coalition. Like, from an integrity perspective, I would need to caveat saying I agree with the book even though I do agree with large chunks of the book and the extent to which I feel the need to caveat this could be reduced. IDK how much you should care about this, but insofar as you care about people like me joining some push you’re trying to make happen this sort of thing makes some difference.
I do think this line of argumentation makes the title literally wrong even if I thought the probability of AI takeover was much higher. I’m not sure how much to care about this, but I do think it randomly imposes a bunch of costs to brand things as “everyone dies” when a substantial fraction of the coalition you might want to work with disagrees and it isn’t a crux. Like, does the message punchyness outweight the costs here from your perspective? IDK.
Responding to some specific points:
I agree that automating alignment with AIs is pretty likely to go very poorly due to incompetence. I think this could go either way and further effort on trying to make this go better is a pretty cost-effective (in terms of using our labor etc) to marginally reduce doom, though it isn’t going to result in a reasonable/safe situation.
To be clear, I don’t think things will be OK exactly nor do I expect that much dignity, though I think I do expect more dignity than you do. My perspective is more like “there seem like there are some pretty effective ways to reduce doom at the margin” than “we’ll be fine because XYZ”.
I don’t think this seriously engages with the argument, though due to this footnote, I retract “they don’t talk at all about trade arguments for keeping humans alive” (I edited my comment).
As far as this section, I agree that it’s totally fine to say “everybody dies” if it’s overwhelmingly more likely everyone dies. I don’t see how this responds to the argument that “it’s not overwhelming likely everyone dies because of acausal (and causal) trade”. I don’t know how important this is, but I also don’t know why you/Eliezer/MIRI feel like it’s so important to argue against this as opposed to saying something like: “AI takeover seems extremely bad and like it would at least kill billions of us. People disagree on exactly how likely vast numbers of humans dying as a result of AI takeover is, but we think it’s at least substantial due to XYZ”. Is it just because you want to use the “everybody dies” part of the title? Fair enough I guess...
Sure, but would the outcome for the aliens be as bad or worse than killing all of them from their perspective? I’m skeptical.
Ty! For the record, my reason for thinking it’s fine to say “if anyone builds it, everyone dies” despite some chance of survival is mostly spelled out here. Relative to the beliefs you spell out above, I think the difference is a combination of (a) it sounds like I find the survival scenarios less likely than you do; (b) it sounds like I’m willing to classify more things as “death” than you are.
For examples of (b): I’m pretty happy to describe as “death” cases where the AI makes things that are to humans what dogs are to wolves, or (more likely) makes some other strange optimized thing that has some distorted relationship to humanity, or cases where digitized backups of humanity are sold to aliens, etc. I feel pretty good about describing many exotic scenarios as “we’d die” to a broad audience, especially in a setting with extreme length constraints (like a book title). If I were to caveat with “except maybe backups of us will be sold to aliens”, I expect most people to be confused and frustrated about me bringing that point up. It looks to me like most of the least-exotic scenarios are ones that rout through things that lay audience members pretty squarely call “death”.
It looks to me like the even more exotic scenarios (where modern individuals get “afterlives”) are in the rough ballpark of quantum immortality / anthropic immortality arguments. AI definitely complicates things and makes some of that stuff more plausible (b/c there’s an entity around that can make trades and has a record of your mind), but it still looks like a very small factor to me (washed out e.g. by alien sales) and feels kinda weird and bad to bring it up in a lay conversation, similar to how it’d be weird and bad to bring up quantum immortality if we were trying to stop a car speeding towards a cliff.
FWIW, insofar as people feel like they can’t literally support the title because they think that backups of humans will be sold to aliens, I encourage them to say as much in plain language (whenever they’re critiquing the title). Like: insofar as folks think the title is causing lay audiences to miss important nuance, I think it’s an important second-degree nuance that the allegedly-missing nuance is “maybe we’ll be sold to aliens”, rather than something less exotic than that.
I don’t think this matters much. I’m happy to consider non-consensual uploading to be death and I’m certainly happy to consider “the humans are modified in some way they would find horrifying (at least on reflection)” to be death. I think “the humans are alive in the normal sense of alive” is totally plausible and I expect some humans to be alive in the normal sense of alive in the majority of worlds where AIs takeover.
Making uploads is barely cheaper than literally keeping physical humans alive after AIs have fully solidified their power I think, maybe 0-3 OOMs more expensive or something, so I don’t think non-consensual uploads are that much of the action. (I do think rounding humans up into shelters is relevant.)
(To answer your direct Q, re: “Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?”, I have heard “well I don’t think it will actually kill everyone because of acausal trade arguments” enough times that I assumed the people discussing those cases thought the argument was substantial. I’d be a bit surprised if none of the ECLW folks thought it was a substantial reason for optimism. My impression from the discussions was that you & others of similar prominence were in that camp. I’m heartened to hear that you think it’s insubstantial. I’m a little confused why there’s been so much discussion around it if everyone agrees it’s insubstantial, but have updated towards it just being a case of people who don’t notice/buy that it’s washed out by sale to hubble-volume aliens and who are into pedantry. Sorry for falsely implying that you & others of similar prominence thought the argument was substantial; I update.)
(I mean, I think it’s a substantial reason to think that “literally everyone dies” is considerably less likely and makes me not want to say stuff like “everyone dies”, but I just don’t think it implies much optimism exactly because the chance of death still seems pretty high and the value of the future is still lost. Like I don’t consider “misaligned AIs have full control and 80% of humans survive after a violent takeover” to be a good outcome.)
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
Fwiw I do just straightforwardly agree that “they might be slightly nice, and it’s really cheap” is a fine reason to disagree with the literal title. I have some odds on this, and a lot of model uncertainty about this.
A thing that is cruxy to me here is that the sort of thing real life humans have done is get countries addicted to opium so they can control their economy, wipe out large swaths of a population while relocating the survivors to reservations, carving up a continent for the purposes of a technologicaly powerful coalition, etc.
Superintelligences would be smarter that Europeans and have an easier time doing things we’d consider moral, but I also think Europeans would be dramatically nicer than AIs.
I can imagine the “it’s just sooooo cheap, tho” argument winning out. I’m not saying these considerations add up to “it’s crazy to think think they’d be slightly nice.” But, it doesn’t feel very likely to me.