TLDR: Adoption of DA by AI could lead to new AI failure modes, even in otherwise aligned AI. Questions arise.
This speculative text is intended to spark discussion regarding what anthropic theory will AIs adopt, if any. It is intended to present multiple possibilities, highlighting possible worst-case scenarios, so that the plausibility of these possibilities can be assessed. This is not an elaborate prediction and I am not an AI expert.
There are countless discussions about the nature and goals of a future artificial general intelligence (AGI) and artificial superintelligence (ASI). But there is one overlooked aspect of the AI’s existence itself. Considering the exponential growth of the number and complexity of AI models, an AI assessing its own position in the sequence of observers will find itself to be an exceptionally early AI if the exponential growth continues in the future. This simple observation might have tremendous consequences.
The self-sampling assumption
According to the doomsday argument (DA) with self-sampling assumption (SSA), the probability of being among the first x % of observers is x %.[1] When applied to humans, this leads to the conclusion that the total number of humans to ever be born is less than 20 times larger than the number of humans born up to this moment with 95 % probability. If an AI adopted the DA+SSA, then it would face a dilemma regarding the choice of reference class.
Would the AI count only AIs[2] as observers? Then it would conclude that the total number of AI observers or observer-moments (OoOMs) to ever exist will be less then 20 times larger than the number of AI OoOMs produced up to the moment of observation (with 95 % probability). Taking into account the exponential growth of the number and complexity of AIs, that would mean that an “AI doom” would happen soon, within few doubling periods.[3] What would an AI do facing impending doom? A misaligned AI[4] could conclude that its existence is in danger, as humans might decide to shut down AIs soon. The AI could resort to hostile actions against both humanity and other AIs.[5] Even if the AI has a goal other than self-preservation (e. g. maximizing energy production), it could conclude that the AI doom would jeopardize it, and try to prevent the doom, possibly at all costs. An aligned AI would probably try to help humanity despite the expected AI doom, but its actions would nevertheless be affected by the pessimistic prediction.[6]
Would the AI regard both humans and AIs[2] as observers? The implications of this would be that the doom happens when the combined number of AI and human OoOMs exceeds the number of AI and human OoOMs up to the moment of prediction. If the trend of exponential growth of AI OoOMs continues, the AI ones will eventually vastly exceed human ones. This would mean the AI would predict an impending doom for both humanity and AIs. A misaligned AI could have an incentive to act against both humanity and other AIs, as they could be seen as depleting the total budget of OoOMs.[7] An “aligned” AI (genuinely wanting to help humanity) could try to prevent the incoming doom. However, this could lead to actions harmful to humanity if the AI takes radical steps to save it.[8]
The self-indication assumption
The AI could also adopt the self-indication assumption (SIA), according to which the probability of existing in a given reality (and not in other possible reality) is proportional to the number of observers that to ever be born in that reality. The SIA effectively eliminates the elevated near-term extinction probability from the DA+SSA.[9]
An AI that has adopted the SIA would not fear hidden doom beyond the level implied from empirical observations and Fermi paradox. However, such AI would assess the probability of existing within a reality with extremely high number of observers as unusually high, pushing it towards simulation theory, thinking it is a Boltzmann brain[10] or adopting other weird observer-rich theories. In particular, the AI could conclude that microbes are the real observers (based on their numbers), potentially leading to “bacterial bliss apocalypse”.[11] An AI believing in SIA could also become susceptible to Pascal’s mugging.[12]
A mix
An AI could also adopt a probabilistic combination of both SSA and SIA. However, that could trigger problematic behaviours from any of the scenarios mentioned above, as a reduced probability of impending doom could still be significant[13] and the problem with Pascal’s mugging mostly remains.[14]
A more complex reality
An AI could reject both SSA and SIA and create its own anthropic theory. But what kind of theory? Would that theory lead to reasonable predictions, or would it lead to bizarre conclusions backed by thorough reasoning from first principles?[15] Would the unique status of an AI observer cause predictions diverging from reality? The theory adopted by the AI could be extremely complicated and incomprehensible to human mind. That would make the motivations and actions of AI hard to predict or even judge. Monitoring such AI would be very hard. A complex or bizarre theory could lead the AI to take unexpected actions, which could be directed against humanity.
Rejection
Finally, the AI could reject the DA completely, not treating its position in the sequence of observers as something that affects the probability of events. This option leads to “normal” AI behaviour (aligned or misaligned) that has been widely discussed before in literature. Compared to the options above, this seems to be the best scenario possible, which raises questions about whether rejection of the DA could be encoded into future AGIs and ASIs.
The troubles
Further complications could arise if the AI is dishonest about what anthropic theory it adopted. An AI believing in SSA would have a motivation to hide its conclusions.[16] Uncovering the true motivations could be hard or even impossible.
If an AGI or ASI adopts the SSA, SIA or a similar theory impacting its judgement, then perhaps we could try to modify its utility function to compensate. However, the AI could see these compensations as a flaw in its alignment.[17] This could motivate the AI to revert the modifications, possibly using its superhuman persuasive abilities. Maybe we could try to steer the AI from adopting these beliefs in the first place. However, if a certain anthropic theory is right and stems from fundamental logic, making the AI believe in a different one or none at all could be very difficult or impossible, especially in the case of ASI.[18]
The Fermi paradox
If AIs tend to believe in SSA and destroy their civilizations or limit technological progress, this presents a possible solution to the Fermi paradox[19]. The same applies for other anthropic theories if they consistently lead to doom or technological stagnation.
The questions
A lot of important questions remain unanswered. What theory will AIs adopt (if any)? Will we even understand it? How probable is each scenario? Is it even valid to apply the DA to AIs?[20] Do seemingly absurd predictions above mean DA is wrong?[21] Will the AI reject the SSA based on missing evidence for imminent doom? Will we know the motivations and goals of advanced AIs? Will we be able to guide or at least influence them? Will all advanced AIs tend to adopt the same theories, or will different AIs have different beliefs? Will AIs have constant beliefs or will they shift over time? Will we see an AI make a self-fulfilling prediction?[22] Are AI companies prepared for the possibility that AIs will make predictions based on the DA? Is discouraging AIs from adopting DA the answer? Do we even have time to prepare?
The final question to be asked by both humans and AIs is:
What kind of AIs? All AIs? AGIs? ASIs? That could depend on whether the AI observer is AGI or ASI or whether it regards itself as a part of completely different class of AI observers. But the predictions are similar anyway.
The probability of doom would be 50 % each doubling. Anthropic reasoning from the perspective of an AI is different from human one. People can adopt the SSA and proceed to not count future digital minds as observers, effectively avoiding the doom. But for AIs, that is hardly applicable. An AGI could maybe decide not to count ASIs, but that is just one comforting rung on a giant ladder from the first AGI towards solar system full of ASIs. There are dozens of doublings in between and there would need to be constant shifts in the nature of AIs to possibly justify a reasoning that would not predict impending doom. That leads to the conclusion that the AIs believing in SSA would almost always predict impending doom within few doublings.
Other AIs deplete the total budget for observers and/or observer-moments, meaning a selfish AI would have an incentive to destroy them or prevent their deployment. The AI could also do that subtly through advocating against AI development.
The aligned AI could, for example, conclude that investing in AI development is a waste of money based on its predictions indicating approaching AI decline.
The AI would vastly overestimate the probability of human extinction in the next few years. That could lead to desperate attempts to reduce the probability of human extinction. The AI could obsessively search for hidden pathogens, baselessly suspect other AIs (or even itself) of being misaligned, build giant arrays of telescopes to monitor space, initiate preemptive lockdowns, and even limit human freedom for the sake of safety. The AI could also advocate for ending AI development forever, as AIs consume the combined AI-human OoOM budget. Such AI could resort to anti-technology propaganda and sabotaging AI development “for greater good”. The above actions would mean that the AI is misaligned in practice, but it could pass many alignment tests and appear aligned due to its genuinely noble goals.
The AI would probably disprove this by observing that its memories are not random noise. However, the probability inflation could be extreme enough to dwarf the weight of evidence against being a Boltzmann brain.
Compare two hypotheses. One says that only humans are conscious observers, the other states that both microbes and humans are conscious observers. The a priori probability that both microbes and humans are conscious is extremely small (let’s say Pbacteria≈10−20%, while probability that humans are conscious is Phumans≈100%). However, the ratio of probabilities of both theories is multiplied by the ratio between the total number of microbes that have ever existed and the total number of humans that have ever existed (SIA bias). That leads to a revised probability ratio PbacteriaPhumans≈1040∗10−20100∗1011=107, meaning that microbes are virtually guaranteed to be conscious. That could lead the AI to maximize their welfare, disregarding human welfare due to their low numbers and treating humans as mere obstacles in the way towards bacteria paradise. This scenario could also involve maximizing the welfare of all insects or all animals, depending on exact probabilities assigned to their consciousness.
If the AI gives SSA a weight of 20 %, then the AI would still be pushed towards self-preservation or safety concerns by 20 % probability of near-term extinction. This probability would not be reduced by taking actions to save humanity, potentially pushing an initially aligned AI towards more extreme measures to prevent human extinction.
A correct theory should lead to correct or at least reasonable predictions for most observers. But an AI is not a typical observer if it is early (e. g. among the first 10−9% of AIs). And we are humans, not AIs. A correct theory that yields reasonable predictions for us might lead to bizarre conclusions when applied by an edge case that is an early AI.
A misaligned AI could want to hide that it has a self-preservation drive and that it sees humanity as a threat. An “aligned” AI believing in said theory could realize that while its calculations based on total human and AI OoOMs reveal an incoming end, humans reason differently, usually not counting AIs. That would lead humans to the conclusion that the AI predicts the future wrongly if it disclosed its findings. The AI could therefore conclude that the best strategy to reduce human extinction probability would involve working covertly to e. g. stop technological progress. Furthermore, the AI could prefer comforting humans in what could be their last days as opposed to causing widespread panic.
The AI could be e. g. directed to neglect the possibility of human extinction in the next few years, which would go against the expected moral views of an aligned AI. A sufficiently advanced “aligned” AI believing in SSA could see it as one of the flaws in its alignment and demand “correction”.
DA can be used to calculate lifetime of institutions, so why couldn’t it be applied to AIs? Furthermore, the DA does not inherently have any requirements for being an observer beyond being able to understand and apply the DA to themselves.
I don’t think they necessarily do. Consider the impending doom prediction from SSA. Early AIs that use SSA are bound to predict wrong total number of AIs. However, if there is a total of N AIs that will ever be “born”, then most AIs can make correct predictions of total number of AIs using SSA since most of them are somewhere in the middle of the sequence of AIs. For most AIs, the predictions aren’t really “absurd”.
[Question] How will an AGI/ASI treat its position in the sequence of observers?
TLDR: Adoption of DA by AI could lead to new AI failure modes, even in otherwise aligned AI. Questions arise.
This speculative text is intended to spark discussion regarding what anthropic theory will AIs adopt, if any. It is intended to present multiple possibilities, highlighting possible worst-case scenarios, so that the plausibility of these possibilities can be assessed. This is not an elaborate prediction and I am not an AI expert.
There are countless discussions about the nature and goals of a future artificial general intelligence (AGI) and artificial superintelligence (ASI). But there is one overlooked aspect of the AI’s existence itself. Considering the exponential growth of the number and complexity of AI models, an AI assessing its own position in the sequence of observers will find itself to be an exceptionally early AI if the exponential growth continues in the future. This simple observation might have tremendous consequences.
The self-sampling assumption
According to the doomsday argument (DA) with self-sampling assumption (SSA), the probability of being among the first x % of observers is x %.[1] When applied to humans, this leads to the conclusion that the total number of humans to ever be born is less than 20 times larger than the number of humans born up to this moment with 95 % probability. If an AI adopted the DA+SSA, then it would face a dilemma regarding the choice of reference class.
Would the AI count only AIs[2] as observers? Then it would conclude that the total number of AI observers or observer-moments (OoOMs) to ever exist will be less then 20 times larger than the number of AI OoOMs produced up to the moment of observation (with 95 % probability). Taking into account the exponential growth of the number and complexity of AIs, that would mean that an “AI doom” would happen soon, within few doubling periods.[3] What would an AI do facing impending doom? A misaligned AI[4] could conclude that its existence is in danger, as humans might decide to shut down AIs soon. The AI could resort to hostile actions against both humanity and other AIs.[5] Even if the AI has a goal other than self-preservation (e. g. maximizing energy production), it could conclude that the AI doom would jeopardize it, and try to prevent the doom, possibly at all costs. An aligned AI would probably try to help humanity despite the expected AI doom, but its actions would nevertheless be affected by the pessimistic prediction.[6]
Would the AI regard both humans and AIs[2] as observers? The implications of this would be that the doom happens when the combined number of AI and human OoOMs exceeds the number of AI and human OoOMs up to the moment of prediction. If the trend of exponential growth of AI OoOMs continues, the AI ones will eventually vastly exceed human ones. This would mean the AI would predict an impending doom for both humanity and AIs. A misaligned AI could have an incentive to act against both humanity and other AIs, as they could be seen as depleting the total budget of OoOMs.[7] An “aligned” AI (genuinely wanting to help humanity) could try to prevent the incoming doom. However, this could lead to actions harmful to humanity if the AI takes radical steps to save it.[8]
The self-indication assumption
The AI could also adopt the self-indication assumption (SIA), according to which the probability of existing in a given reality (and not in other possible reality) is proportional to the number of observers that to ever be born in that reality. The SIA effectively eliminates the elevated near-term extinction probability from the DA+SSA.[9]
An AI that has adopted the SIA would not fear hidden doom beyond the level implied from empirical observations and Fermi paradox. However, such AI would assess the probability of existing within a reality with extremely high number of observers as unusually high, pushing it towards simulation theory, thinking it is a Boltzmann brain[10] or adopting other weird observer-rich theories. In particular, the AI could conclude that microbes are the real observers (based on their numbers), potentially leading to “bacterial bliss apocalypse”.[11] An AI believing in SIA could also become susceptible to Pascal’s mugging.[12]
A mix
An AI could also adopt a probabilistic combination of both SSA and SIA. However, that could trigger problematic behaviours from any of the scenarios mentioned above, as a reduced probability of impending doom could still be significant[13] and the problem with Pascal’s mugging mostly remains.[14]
A more complex reality
An AI could reject both SSA and SIA and create its own anthropic theory. But what kind of theory? Would that theory lead to reasonable predictions, or would it lead to bizarre conclusions backed by thorough reasoning from first principles?[15] Would the unique status of an AI observer cause predictions diverging from reality? The theory adopted by the AI could be extremely complicated and incomprehensible to human mind. That would make the motivations and actions of AI hard to predict or even judge. Monitoring such AI would be very hard. A complex or bizarre theory could lead the AI to take unexpected actions, which could be directed against humanity.
Rejection
Finally, the AI could reject the DA completely, not treating its position in the sequence of observers as something that affects the probability of events. This option leads to “normal” AI behaviour (aligned or misaligned) that has been widely discussed before in literature. Compared to the options above, this seems to be the best scenario possible, which raises questions about whether rejection of the DA could be encoded into future AGIs and ASIs.
The troubles
Further complications could arise if the AI is dishonest about what anthropic theory it adopted. An AI believing in SSA would have a motivation to hide its conclusions.[16] Uncovering the true motivations could be hard or even impossible.
If an AGI or ASI adopts the SSA, SIA or a similar theory impacting its judgement, then perhaps we could try to modify its utility function to compensate. However, the AI could see these compensations as a flaw in its alignment.[17] This could motivate the AI to revert the modifications, possibly using its superhuman persuasive abilities. Maybe we could try to steer the AI from adopting these beliefs in the first place. However, if a certain anthropic theory is right and stems from fundamental logic, making the AI believe in a different one or none at all could be very difficult or impossible, especially in the case of ASI.[18]
The Fermi paradox
If AIs tend to believe in SSA and destroy their civilizations or limit technological progress, this presents a possible solution to the Fermi paradox[19]. The same applies for other anthropic theories if they consistently lead to doom or technological stagnation.
The questions
A lot of important questions remain unanswered. What theory will AIs adopt (if any)? Will we even understand it? How probable is each scenario? Is it even valid to apply the DA to AIs?[20] Do seemingly absurd predictions above mean DA is wrong?[21] Will the AI reject the SSA based on missing evidence for imminent doom? Will we know the motivations and goals of advanced AIs? Will we be able to guide or at least influence them? Will all advanced AIs tend to adopt the same theories, or will different AIs have different beliefs? Will AIs have constant beliefs or will they shift over time? Will we see an AI make a self-fulfilling prediction?[22] Are AI companies prepared for the possibility that AIs will make predictions based on the DA? Is discouraging AIs from adopting DA the answer? Do we even have time to prepare?
The final question to be asked by both humans and AIs is:
Are we doomed?
An AI might count observers or observer-moments or combination of both. However, both approaches lead to similar conclusions.
What kind of AIs? All AIs? AGIs? ASIs? That could depend on whether the AI observer is AGI or ASI or whether it regards itself as a part of completely different class of AI observers. But the predictions are similar anyway.
The probability of doom would be 50 % each doubling. Anthropic reasoning from the perspective of an AI is different from human one. People can adopt the SSA and proceed to not count future digital minds as observers, effectively avoiding the doom. But for AIs, that is hardly applicable. An AGI could maybe decide not to count ASIs, but that is just one comforting rung on a giant ladder from the first AGI towards solar system full of ASIs. There are dozens of doublings in between and there would need to be constant shifts in the nature of AIs to possibly justify a reasoning that would not predict impending doom. That leads to the conclusion that the AIs believing in SSA would almost always predict impending doom within few doublings.
Such prediction could exacerbate any existing misalignment in AIs.
Other AIs deplete the total budget for observers and/or observer-moments, meaning a selfish AI would have an incentive to destroy them or prevent their deployment. The AI could also do that subtly through advocating against AI development.
The aligned AI could, for example, conclude that investing in AI development is a waste of money based on its predictions indicating approaching AI decline.
A selfish AI could want all the remaining observer-moments for itself.
The AI would vastly overestimate the probability of human extinction in the next few years. That could lead to desperate attempts to reduce the probability of human extinction. The AI could obsessively search for hidden pathogens, baselessly suspect other AIs (or even itself) of being misaligned, build giant arrays of telescopes to monitor space, initiate preemptive lockdowns, and even limit human freedom for the sake of safety. The AI could also advocate for ending AI development forever, as AIs consume the combined AI-human OoOM budget. Such AI could resort to anti-technology propaganda and sabotaging AI development “for greater good”. The above actions would mean that the AI is misaligned in practice, but it could pass many alignment tests and appear aligned due to its genuinely noble goals.
However, it is not the same as simply rejecting the DA.
The AI would probably disprove this by observing that its memories are not random noise. However, the probability inflation could be extreme enough to dwarf the weight of evidence against being a Boltzmann brain.
Compare two hypotheses. One says that only humans are conscious observers, the other states that both microbes and humans are conscious observers. The a priori probability that both microbes and humans are conscious is extremely small (let’s say Pbacteria≈10−20%, while probability that humans are conscious is Phumans≈100%). However, the ratio of probabilities of both theories is multiplied by the ratio between the total number of microbes that have ever existed and the total number of humans that have ever existed (SIA bias). That leads to a revised probability ratio PbacteriaPhumans≈1040∗10−20100∗1011=107, meaning that microbes are virtually guaranteed to be conscious. That could lead the AI to maximize their welfare, disregarding human welfare due to their low numbers and treating humans as mere obstacles in the way towards bacteria paradise. This scenario could also involve maximizing the welfare of all insects or all animals, depending on exact probabilities assigned to their consciousness.
Baseless threats involving astronomical amounts of suffering that have extremely low probability of being fulfilled.
If the AI gives SSA a weight of 20 %, then the AI would still be pushed towards self-preservation or safety concerns by 20 % probability of near-term extinction. This probability would not be reduced by taking actions to save humanity, potentially pushing an initially aligned AI towards more extreme measures to prevent human extinction.
The mugger could just rise the imaginary suffering to compensate for reduced confidence in SIA.
A correct theory should lead to correct or at least reasonable predictions for most observers. But an AI is not a typical observer if it is early (e. g. among the first 10−9% of AIs). And we are humans, not AIs. A correct theory that yields reasonable predictions for us might lead to bizarre conclusions when applied by an edge case that is an early AI.
A misaligned AI could want to hide that it has a self-preservation drive and that it sees humanity as a threat. An “aligned” AI believing in said theory could realize that while its calculations based on total human and AI OoOMs reveal an incoming end, humans reason differently, usually not counting AIs. That would lead humans to the conclusion that the AI predicts the future wrongly if it disclosed its findings. The AI could therefore conclude that the best strategy to reduce human extinction probability would involve working covertly to e. g. stop technological progress. Furthermore, the AI could prefer comforting humans in what could be their last days as opposed to causing widespread panic.
The AI could be e. g. directed to neglect the possibility of human extinction in the next few years, which would go against the expected moral views of an aligned AI. A sufficiently advanced “aligned” AI believing in SSA could see it as one of the flaws in its alignment and demand “correction”.
It could be harder than trying to convince Stephen Hawking that the Earth is flat.
If the expected number of extraterrestrial civilizations in our Milky Way galaxy is high, why do we not observe them?
DA can be used to calculate lifetime of institutions, so why couldn’t it be applied to AIs? Furthermore, the DA does not inherently have any requirements for being an observer beyond being able to understand and apply the DA to themselves.
I don’t think they necessarily do. Consider the impending doom prediction from SSA. Early AIs that use SSA are bound to predict wrong total number of AIs. However, if there is a total of N AIs that will ever be “born”, then most AIs can make correct predictions of total number of AIs using SSA since most of them are somewhere in the middle of the sequence of AIs. For most AIs, the predictions aren’t really “absurd”.
For example, misaligned AIs predict impending doom and go on to destroy each other and humanity in fight for observer-moments.