In the language of Superintelligent AI is necessary for an amazing future but far from sufficient, I expect that the majority of possible s-risks are weak dystopias rather than strong dystopias. We’re unlikely to succeed at alignment enough and then signflip it (like, I expect strong dystopia to be dominated by ‘we succeed at alignment to an extreme degree’ ^ ‘our architecture is not resistant to signflips’ ^ ‘somehow the sign flips’). So, I think literal worse-case Hell and the immediate surrounding possibilities are negligible.
I expect that the extrema of most AIs, even ones with attempted alignment patches, to be weird and unlikely to be of particular value to us. The ways values resolve has a lot of room to maneuver early on, before it becomes a coherent agent, and I don’t expect those to have extrema that are best fit by humans (see various of So8res other posts). Thus, I think it is unlikely that we end up with a weak dystopia (at least for a long time, which is the s-risk) relative to x-risk.
. If I imagine trading extreme suffering for extreme bliss personally, I end up with ratios of 1 to 300 million – e.g., that I would accept a second of extreme suffering for ten years of extreme bliss. The ratio is highly unstable as I vary the scenarios, but the point is that I disvalue suffering many orders of magnitude more than I value bliss.
I also disvalue suffering significantly more than I value happiness (I think bliss is the wrong term to use here), but not to that level. My gut feeling wants to dispute those numbers as being practical, but I’ll just take them as gesturing at the comparative feeling.
An idea that I’ve seen once, but not sure where, is: you can probably improve the amount of happiness you experience in a utopia by a large amount.
Not through wireheading, which at least for me is undesirable, but ‘simply’ redesigning the human mind in a less hedonic-treadmill manner (while also not just cutting out boredom).
I think the usual way of visualizing extreme dystopias as possible-futures has the issue that it is easy to compare them to the current state of humanity rather than an actual strong utopia. I expect that there’s a good amount of mind redesign work, in the vein of some of the mind-design posts in Fun Theory but ramped up to superintelligence design+consideration capabilities, that would vastly increase the amount of possible happiness/Fun and make the tradeoff more balanced.
I find it plausible that suffering is just easier to cause and more impactful even relative to strong-utopia-level enhanced-minds, but I believe this does change the calculus significantly. I might not take a 50⁄50 coin for strong dystopia/strong utopia, but I’d maybe take a 10⁄90 coin. Thankfully we aren’t in that scenario, and have better odds.
Thanks for linking that interesting post! (Haven’t finished it yet though.) Your claim is a weak one though, right? Only that you don’t expect the entirely lightcone of the future to be filled with worst-case hell, or less than 95% of it? There are a bunch of different definitions of s-risk, but what I’m worried about definitely starts at a much smaller-scale level. Going by the definitions in that paper (p. 3 or 391), maybe the “astronomical suffering outcome” or the “net suffering outcome.”
I primarily mentioned it because I think people base their ‘what is the S-risk outcome’ on basically antialigned AGI. The post has ‘AI hell’ in the title and uses comparisons between extreme suffering versus extreme bliss, calls s-risks more important than alignment (which I think makes sense to a reasonable degree if antialigned s-risk is likely or a sizable portion of weaker dystopias are likely, but I don’t think makes sense for antialigned being very unlikely and my considering weak dystopias to also be overall not likely) .
The extrema argument is why I don’t think that weak dystopias are likely, because I think that—unless we succeed at alignment to a notable degree—then the extremes of whatever values shake out are not something that keeps humans around for very long. So I don’t expect weaker dystopias to occur either.
I expect that most AIs aren’t going to value making a notable deliberate AI hell, whether out of the lightcone or 5% of it or 0.01% of it. If we make an aligned-AGI and then some other AGI says ‘I will simulate a bunch of humans in torment unless you give me a planet’ then I expect that our aligned-AGI uses a decision-theory that doesn’t give into dt-Threats and doesn’t give in (and thus isn’t threatened, because the other AGI gains nothing from actually simulating humans in that).
So, while I do expect that weak dystopias have a noticeable chance of occurring, I think it is significantly unlikely? It grows more likely we’ll end up in a weak dystopia as alignment progresses. Like if we manage to get enough of a ‘caring about humans specifically’ (though I expect a lot of attempts like that to fall apart and have weird extremes when they’re optimized over!), then that raises the chances of a weak dystopia.
However I also believe that alignment is roughly the way to solve these. To get notable progress on making AGIs avoid specific area, I believe that requires more alignment progress than we have currently.
There is the class of problems where the unaligned AGI decides to simulate us to get more insight into humans, insight into evolved species, and insight into various other pieces of that. That would most likely be bad, but I expect it to not be a significant portion of computation and also not continually executed for (really long length of time). So I don’t consider that to be a notable s-risk.
I’m also not sure that I consider astronomical suffering outcome (by how its described in the paper) to be bad by itself.
If you have (absurd amount of people) and they have some amount of suffering (ex: it shakes out that humans prefer some degree of negative-reinforcement as possible outcomes, so it remains) then that can be more suffering in terms of magnitude, but has the benefits of being more diffuse (people aren’t broken by a short-term large amount of suffering) and with less individual extremes of suffering.
Obviously it would be bad to have a world that has astronomical suffering that is then concentrated on a large amount of people, but that’s why I think—a naive application of—astronomical suffering is incorrect because it ignores diffuse experiences, relative experiences (like, if we have 50% of people with notably bad suffering today, then your large future civilization with only 0.01% of people with notably bad suffering can still swamp that number, though the article mentions this I believe), and more minor suffering adding up over long periods of time.
(I think some of this comes from talking about things in terms of suffering versus happiness rather than negative utility versus positive utility? Where zero is defined as ‘universe filled with things we dont care about’. Like, you can have astronomical suffering that isn’t that much negative utility because it is diffuse / lower in a relative sense / less extreme, but ‘everyone is having a terrible time in this dystopia’ has astronomical suffering and high negative utility)
In the language of Superintelligent AI is necessary for an amazing future but far from sufficient, I expect that the majority of possible s-risks are weak dystopias rather than strong dystopias. We’re unlikely to succeed at alignment enough and then signflip it (like, I expect strong dystopia to be dominated by ‘we succeed at alignment to an extreme degree’ ^ ‘our architecture is not resistant to signflips’ ^ ‘somehow the sign flips’). So, I think literal worse-case Hell and the immediate surrounding possibilities are negligible.
I expect that the extrema of most AIs, even ones with attempted alignment patches, to be weird and unlikely to be of particular value to us. The ways values resolve has a lot of room to maneuver early on, before it becomes a coherent agent, and I don’t expect those to have extrema that are best fit by humans (see various of So8res other posts). Thus, I think it is unlikely that we end up with a weak dystopia (at least for a long time, which is the s-risk) relative to x-risk.
I also disvalue suffering significantly more than I value happiness (I think bliss is the wrong term to use here), but not to that level. My gut feeling wants to dispute those numbers as being practical, but I’ll just take them as gesturing at the comparative feeling.
An idea that I’ve seen once, but not sure where, is: you can probably improve the amount of happiness you experience in a utopia by a large amount. Not through wireheading, which at least for me is undesirable, but ‘simply’ redesigning the human mind in a less hedonic-treadmill manner (while also not just cutting out boredom). I think the usual way of visualizing extreme dystopias as possible-futures has the issue that it is easy to compare them to the current state of humanity rather than an actual strong utopia. I expect that there’s a good amount of mind redesign work, in the vein of some of the mind-design posts in Fun Theory but ramped up to superintelligence design+consideration capabilities, that would vastly increase the amount of possible happiness/Fun and make the tradeoff more balanced. I find it plausible that suffering is just easier to cause and more impactful even relative to strong-utopia-level enhanced-minds, but I believe this does change the calculus significantly. I might not take a 50⁄50 coin for strong dystopia/strong utopia, but I’d maybe take a 10⁄90 coin. Thankfully we aren’t in that scenario, and have better odds.
Thanks for linking that interesting post! (Haven’t finished it yet though.) Your claim is a weak one though, right? Only that you don’t expect the entirely lightcone of the future to be filled with worst-case hell, or less than 95% of it? There are a bunch of different definitions of s-risk, but what I’m worried about definitely starts at a much smaller-scale level. Going by the definitions in that paper (p. 3 or 391), maybe the “astronomical suffering outcome” or the “net suffering outcome.”
I primarily mentioned it because I think people base their ‘what is the S-risk outcome’ on basically antialigned AGI. The post has ‘AI hell’ in the title and uses comparisons between extreme suffering versus extreme bliss, calls s-risks more important than alignment (which I think makes sense to a reasonable degree if antialigned s-risk is likely or a sizable portion of weaker dystopias are likely, but I don’t think makes sense for antialigned being very unlikely and my considering weak dystopias to also be overall not likely) . The extrema argument is why I don’t think that weak dystopias are likely, because I think that—unless we succeed at alignment to a notable degree—then the extremes of whatever values shake out are not something that keeps humans around for very long. So I don’t expect weaker dystopias to occur either.
I expect that most AIs aren’t going to value making a notable deliberate AI hell, whether out of the lightcone or 5% of it or 0.01% of it. If we make an aligned-AGI and then some other AGI says ‘I will simulate a bunch of humans in torment unless you give me a planet’ then I expect that our aligned-AGI uses a decision-theory that doesn’t give into dt-Threats and doesn’t give in (and thus isn’t threatened, because the other AGI gains nothing from actually simulating humans in that).
So, while I do expect that weak dystopias have a noticeable chance of occurring, I think it is significantly unlikely? It grows more likely we’ll end up in a weak dystopia as alignment progresses. Like if we manage to get enough of a ‘caring about humans specifically’ (though I expect a lot of attempts like that to fall apart and have weird extremes when they’re optimized over!), then that raises the chances of a weak dystopia.
However I also believe that alignment is roughly the way to solve these. To get notable progress on making AGIs avoid specific area, I believe that requires more alignment progress than we have currently.
There is the class of problems where the unaligned AGI decides to simulate us to get more insight into humans, insight into evolved species, and insight into various other pieces of that. That would most likely be bad, but I expect it to not be a significant portion of computation and also not continually executed for (really long length of time). So I don’t consider that to be a notable s-risk.
I’m also not sure that I consider astronomical suffering outcome (by how its described in the paper) to be bad by itself.
If you have (absurd amount of people) and they have some amount of suffering (ex: it shakes out that humans prefer some degree of negative-reinforcement as possible outcomes, so it remains) then that can be more suffering in terms of magnitude, but has the benefits of being more diffuse (people aren’t broken by a short-term large amount of suffering) and with less individual extremes of suffering. Obviously it would be bad to have a world that has astronomical suffering that is then concentrated on a large amount of people, but that’s why I think—a naive application of—astronomical suffering is incorrect because it ignores diffuse experiences, relative experiences (like, if we have 50% of people with notably bad suffering today, then your large future civilization with only 0.01% of people with notably bad suffering can still swamp that number, though the article mentions this I believe), and more minor suffering adding up over long periods of time.
(I think some of this comes from talking about things in terms of suffering versus happiness rather than negative utility versus positive utility? Where zero is defined as ‘universe filled with things we dont care about’. Like, you can have astronomical suffering that isn’t that much negative utility because it is diffuse / lower in a relative sense / less extreme, but ‘everyone is having a terrible time in this dystopia’ has astronomical suffering and high negative utility)