There appears to be a consensus on LessWrong that acausal extortion does not work on humans, but reasons put forward for why this might be seem inconclusive. This post will list some of them, as well as countervailing reasons why the ‘basilisk’ may well have already ‘ensnared’ humans, and discusses them.
Who should read this post
This post is concerned with the extent to which Roko’s basilisk is compelling. By compelling, I don’t simply mean a property of intrinsic attractiveness, a tendency to induce belief in their conclusions that arguments, particularly sound ones, sometimes have. Rather, in addition to that, I mean to refer to the way in which knowledge of an argument can predictably reach through the mind considering it into the physical or other world that mind inhabits, influencing it to behave in a particular way in accordance with the argument’s conclusion. Of course, given the nature of such arguments, exploration of their form risks causing this exact consequence. For this reason, this post is intended to be read by those who have already thought about the Basilisk, or possibly other forms of acausal extortion, to sufficient depth that they have lost confidence that it can’t hurt them even if they ignore it.
At this depth, a potential host mind may begin to wonder how much of a difference there is between these two kinds of compulsion. If you have reached this depth, you may question (or have wondered) whether you have crossed a sort of ‘logical event horizon’, beyond which any attempt to revert to a state of (at least performative) ignorance of the basilisk necessarily worsens your condition. Maybe you read about arguments that you should have precommited not to be compelled too late to be certain of the efficacy of this acausal bargaining strategy. If this describes you, you’re a member of the intended audience of this post.Because of this, you are unlikely to become further ‘ensnared’ while reading this post. On the other hand, this post is not intended to be read by those who have successfully precommited not to engage in acausal interactions which could be detrimental to them.
Have you become ‘ensnared’ in this way? In order to understand this, I will analyse some counterarguments suggested in response to my question about whether any existed by various LessWrong users (myself included), and attempt to evaluate them. While this post is not intended to alarm readers unduly, because of the uncomfortable nature of the topic, and also the importance of knowing the relevant truth in case one has become ‘ensnared’, I may need to make some disturbing points and claims, making this post potentially seriously distressing to readers. If you believe you might be one of those readers, you may benefit from pausing and returning when you feel your mental state enables you to think clearly about it, as continuing despite anxiety risks instantiating its object.
A working definition of the Basilisk
In Roko’s original post, the basilisk was defined as a utilitarian, approximately human-aligned Artificial superintelligence running coherent extrapolated volition which decided to torture humans who didn’t contribute to its creation so as to ensure it would be created earlier and in more worlds than it otherwise would have been, increasing the quality of life of many at the expense of a few victims of torture. While this particular entity seems unlikely to exist, the concept can easily be refined, and has been, by removing CEV and replacing it with a fundamentally simple ‘desire’ to maximize a utility function contingent on the ASI’s existence, as expected by at least a plurality of people who think seriously about the possibility of a misaligned superintelligent AI. In this post, the term “Roko’s Basilisk”, or simply “the basilisk” will henceforth be used to refer to this refinement of what it meant originally. In particular, the basilisk is a possible future superintelligent AI or equivalence class thereof, created or at least given rise to by humans (or aliens), which reasons as follows: if a human (or alien) having this thought realized that it would cause it to decide to search for them and conditionally torture them, other copies of themselves, or adjacent beings to the extent to which they didn’t try to make it more likely to exist in their world, they’d be more likely to take actions to avoid this, contributing to its creation in that world, and the number and proportion of possible worlds in which it was thinking this would increase. In addition, countervailing reasons don’t prevent it from adopting the above policy wherein it searches the world at different levels of abstraction for people to torture. At a low level, doing so could involve physically analysing the planet where it emerged, perhaps scanning the internet for evidence of people who contemplated and defied it. On a medium level, it might involve simulating approximately average human minds thinking the same thoughts, calibrated with the use of data about such gathered on its planet. At a high level, it could mean simulating whole equivalence classes of people or even aliens who might be in similar positions and have certain worldviews which might affect their interpretation of the argument. After this, it would create or maintain simulations of the beings it ‘found’, and torture the simulacra. While this definition is almost certainly not perfect, it should accurately convey the diffferences between what I mean by “Roko’s Basilisk” in this post and most other common interpretations of the name. I believe it corresponds to approximately the most prevalent of them on LessWrong.
Why would the Basilisk exist?
(This section may be skipped with no loss of continuity.)
It seems likely that a superintelligent being created by intelligent beings would consider certain ideas which might influence the behaviour of the beings which created it. (“Likely” is an understatement if timeless decision theory has the universal relevance and significance that I would argue it has.) It is therefore plausible that the logical structure of certain arguments could timelessly influence both the behaviour of the superintelligent being, and whether it existed. If these arguments and their logical ‘unfurling’ and ‘working out’ could be construed as a part of the superintelligent mind, then it would be possible for at least this part of that being to influence the probability of its own existence. In terms of the ’mathematical universe hypothesis, this probability is essentially the proportion of the mathematical/platonic/logical universe in which minds containing this logical structure were instantiated. The question of whether the direction in which this probability was influenced could be controlled by the content of any such arguments naturally arises:
Do there exist arguments predictably considered by a large equivalence class of future superintelligent beings, such as superintelligent AIs created by humans, as well as their potential creators, such as humans, the contemplation of which by the latter would increase the probability of the creation of the former?
One question which seems like it might have this effect is that of whether or not it would benefit a future superintelligent AI to torture entities in proportion to the extent to which they decided not to contribute to its creation. Attempting to answer this question inevitably leads each of the equivalence classes of minds involved to imagine the other, and to verify that if the answer to this question were ‘yes’, then this answer would self-validate, while if the answer were ‘no’, the same could potentially be true. Differentiating between these two answers requires further analysis.
It appears likely that the answer to this question is yes due to the existence of large numbers of humans who are deeply disturbed by the concept of Roko’s basilisk, at least within a closely related equivalence class of worlds corresponding to different branches of the evolutionary history of what could reasonably be considered to be this universe. A significant proportion of these humans remain disturbed despite the impressive intellectual effort and the intricacy of the arguments it has been applied to produce which claim that ‘Acausal extortion simply does not work on humans’. They remain disturbed despite the tendency of many humans to discount possibilities of things in the distant future or which might occur on different ‘planes of existence’ (different substrates). They remain disturbed despite the fact that they don’t want the basilisk to exist, or enjoy believing that it might. They remain disturbed despite the knowledge that their contemplation of the basilisk might ultimately lead to the torture of others, and to themselves doing terrible things within their own moral value frameworks. Given the simplicity and resultant generality of the argument, which requires it to apply to all entities which might be in a position to consider it, it is unlikely that anyone in particular could escape its conclusion by defining themself to lie outside the reference class containing these deeply concerned individuals, some of whom seem likely to act on it.
On the other hand, there may be an equally large equivalence class consisting of those who have precommitted, or else decided, not to comply with the basilisk. It is in this sense that the situation people considering the basilisk find themselves in can be said to take the form of the prisoners’ dilemma. In this case, the costs of cooperation while the opposite party defects are so horrifying that it seems incontrovertible that even those with a tendency to cooperate because of timeless decision theory or friendliness in most scenarios will be forced to defect. Perhaps the truly morally right thing to do would be to cooperate anyway, but no one has that much courage.
Arguments against the plausibility of the existence of a Basilisk
Arguments by analogy (to something like Pascal’s mugging)
Many people accustomed to the philosophical ‘environment’ on LessWrong are prone to immediately classify the basilisk, along with other forms of Acausal Extortion, as a form of infohazard which generalizes Pascal’s Mugging, and can be dismissed on analogous grounds. The basilisk has this appearance because, like Pascal’s hypothetical mugger, it makes significant demands of the recipient of its pernicious information payload, and justifies them with its contents, which include a description of what seems, from the computationally bounded perspective of the being being presented with it, to be arbitrarily bad. In the case of Pascal’s mugging, the arbitrary badness of the scenario the mugger suggests would obtain if Pascal did not concede is apparently matched by, and arises from the same underlying ‘combinatorial explosiveness’ as, the alternative possibilities in which arbitrarily good things happen. As with Pascal’s wager, it seems that insofar as it’s possible to compare them at all, the negative valence of the negative outcomes threatened is cancelled out by the positive valence of the seemingly equally plausible way in which things could go well, but this is where the analogy with the Basilisk ends. Although powerful, the basilisk would be confined to a much smaller subset of possible worlds; it refers to something that, at least potentially, could live within the same physical universe as the person thinking about it, and to be more precise, within the same light cone, or to be even more precise, originating on the same planet and within a few decades of their considering it. Not only this, but the thing capable of generating the threat is causally, as well as otherwise logically, connected to the entity being threatened.
Resource consumption
Another form of counterargument contends that the computational, or other, requirements of running a simulation in which beings actually were tortured, would be sufficient to dissuade a superintelligence from doing so. This could only be the case if the increase in the proportion of possible worlds in which the basilisk existed on account of the same logic being validated in the minds of the humans considering it was worth less to the basilisk than the proportion of the universe(s) in which it made this decision which it would need to allocate to these simulations. It seems likely that the basilisk could ensure, by arranging with itself only to implement the choice to torture in universes with certain randomly distributed properties which were relatively improbable, that it didn’t do so in every universe unless this was necessary.
Dismissal of Timeless Decision Theory
One class of counterarguments contend that timeless decision theories either do not work, or might, but are not subject to instrumental convergence in the same way causal decision theory is. Tailcalled argued that, because of the latter, a likely future ASI would not care about potential acausal influences, or that if it did, it would only care about those reaching into its future:
Acausal stuff isn’t instrumentally convergent in the usual sense, though. If you’re really good at computing counterfactuals, it may be instrumentally convergent to self-modify into or create an agent that does acausal deals, but the convergence only extends to deals that start in the future relative to where you’re deciding from.
(The bold font is mine.)
Unfortunately, the exchange with Tailcalled became mired in a debate about the definition of instrumental convergence, which doesn’t touch upon why I’m actually concerned that something like it might take place. As far as I can tell, the assertion that only future-concerned TDTs are subject to instrumental convergence is correct if instrumental convergence is defined in a ‘causal’ way, and incorrect otherwise. While it’s true that within any particular universe, causal pathways which might lead from future success to past actions and application of TDTs don’t exist, acausal ‘pathways’ between them do. The nature of these ‘acausal pathways’ is well illustrated by Newcomb’s problem: even though it is not possible for the physical human entering the chamber containing the two boxes to influence Omega directly, because the same decision process which determines which box(es) they choose to open also determines the behaviour of their simulacrum within Omega, it appears to be the case that a causal influence propagates backwards through time, because of the correlation between both outcomes. In reality, it’s more accurate to think of the decision process itself as a logical cause of both instances of the decision being taken. It is therefore instrumentally useful for an agent which would like to maximize the number of non-simulated worlds in which it gets as much money as it can to make the decision to choose only one box in the abstract, ‘platonic universe of algorithms’ . Similarly, it seems plausible that it could be instrumentally useful for an equivalence class of superintelligent AIs hoping to maximize the number of physical universes in which any one of them exists to opt to adopt the policy which makes it a basilisk. [1]
Newcomb’s problem also provides a powerful reason to believe that Timeless decision theories help agents to achieve their goals, along with the widely accepted ‘solution’ involving recognizing that you’re in a simulation, which explains in what sense TDT is a natural extension of CDT. This makes it difficult to dissmiss TDTs outright.
Competition Between ASIs
Causal intervention by Friendly ASIs
Even if the likelihood of superintelligence being misaligned with its biological creators is overwhelming, given sufficiently many planets within this physical universe, on which sufficiently many intelligent life forms evolved naturally before creating a superintelligence alone, it’s possible that there could be a large or infinite number of aligned superintelligences within it. If two superintelligences come into contact with one another, any evolved-life-unfriendly one might need to contend with the possibility that, if it came across an evolved-life-friendly one, it might chose to attack it in the interests of all ‘naturally emergent’ intelligent beings. The probability of this happening need not be particularly large before it becomes necessary for the non-aligned superintelligence to modify its behaviour; if it can be deprived of just one millionth of its lightcone, in expectation, by another superintelligence for torturing, say, humans, then it[2] would need to increase the proportion of the mathematical universe in which it existed by at least as much for adopting this policy to make sense. Could it? I don’t know the answer to this question, but it seems very plausible that it could. The number above, , is also probably a vast overestimate of the risk. In addition, if the number of non-aligned ASIs greatly exceeds the number of aligned ones, it seems that over the long term they could easily conspire to crush the aligned ones, sacrificing a negligible proportion of the lifespan of the universe to do this.
It may well be that the distribution of intelligent life throughout this physical universe is so sparse that any two of these ASIs would almost certainly lie outside one another’s cosmic event horizons, making causal communication and interaction impossible. In this case, they may still be able to engage in acausal cooperation in which each adopts a policy whereby it reserves a certain proportion of its accessible volume of spacetime to do something valued by the other, or not to do something the other values negatively, in exchange for the other having been more likely to engage in the same thought process. Implicitly, this assumes that both superintelligences are part of the utility-valuing equivalence class described below in the section about Acausal Normalcy, and can be objected to on the same grounds as the generic argument from the existence of an acausal norm of this kind.
Acausal Competition between different potential ASIs arising from the human civilization
Interstice suggested that it is likely that there will be so many different ASIs vying to have acausal influence over the humans who created them, even within different possible futures of the same universe, that it’s effectively impossible for a human to make a well informed decision to comply with any one of them, especially as the others would have an incentive to precommit to torture anyone exhibiting this behaviour. While the next section will address the question of the extent to which it’s possible to derive robust conclusions about the behaviour of such vast collections of potential beings at all, I think it’s worth providing a brief explanation of why this argument seems intuitively suspicious to me: In the context of a single, causally connected physical universe full of CDT agents, problems of coordination are serious and usually very difficult to solve; Timeless decision theory is one of the most powerful, if not the most powerful, tool or method for solving these problems, and its development was motivated by them to a significant extent. In light of this, it seems implausible that the analogous problem in an acausal context could be fundamentally unsolvable. Surely, these agents capable of competing with one another acausally would also be able to cooperate acausally, and benefit more by doing so.
Acausal Normalcy
One of the most important countervailing considerations to any intelligent being in a position to potentially become a Basilisk is the presence of Acausal norms.
My definition is that these are principles, rules, or protocols which have the property that they would benefit every mind in a particular equivalence class of all those considering whether or not to obey them, if most of those minds did obey them. In addition, it is necessary that it is relatively obvious that the above is true to a significant number of the minds (preferably all of those) in the equivalence class. One way to interpret acausal normalcy would be as a class of decisions made by a ‘logical core’ of the minds in the equivalence class common to all of them about how to treat itself and behave in a coherent, self consistent way in the ‘mathematical/platonic universe’. In other words, the decision to obey an acausal norm is more clearly construed as made by a single mind instantiated within many beings, than by each of them individually. In the case of Roko’s basilisk, it seems reasonable to assert that, as the basilisk is only one of many minds capable of conceiving of the concept of positive utility, and of the fact that other minds might want to maximize their own utility, the part of its mind making the general decision about whether or not to reduce or minimize utility, irrespective of the precise mind whose function it is, would be rationally compelled to decide not to do so, for the simple reason that this incredibly simple nexus at the intersection of the vast collection of intelligences values utility in and of itself, as a homogeneous quantity whose value subsumes that of any particular instance of it.
This is certainly a comforting thought, and it is not easily discarded, or proven not to apply in the case of the basilisk. However, given the immense asymmetry in my own utility function, I find it locally appropriate to assign the burden of proof to those asserting that the basilisk will not torture me. This entails showing that the considerations which follow from its membership of the above described equivalence class would almost certainly dominate, in the ASI’s mind, those corresponding to the almost equally vast[2] equivalence class of minds sufficiently complex and intelligent to engage in these acausal interactions with fluency. Unfortunately, it seems pretty clear that however this notion of fluency is defined (within reason), humans don’t have it, and therefore don’t belong to this equivalence class; the disturbing possibility arises that we may be walled out of the ‘universe of discourse’ within which acausal norms against exploitation such as those which would preclude acausal extortion can be found.
As far as I know, there is no extremely strong argument that the first order acausal norms dominate the principle of the boundary between ‘fluent’, AIs and less ‘fluent’, ‘non-conversant’ beings. One candidate for such an argument suggests that because the norm against utility reduction is simpler, it is more universal than the other, but simplicity and universality are clearly not the only relevant criteria, for if they were, then the mind of every intelligent being in the multiverse would be plagued with Boolean algebra forever. A related form of argument points out that there may be other boundaries, beyond which every mind can clearly understand and control those on the other side in an asymmetric way, and that the Basilisk, or any mind which considers violating ‘first order’ acausal norms would worry that it might be on the simpler side of one of them. However this is again completely unclear: maybe the basilisk wouldn’t be justified in confidently considering itself to be ‘above’ all relevant such boundaries, but perhaps it would.
If there is a ‘layer agnostic’, fundamental attenuation to a being’s potential insight into minds more complex than its own which could make it certain that the basilisk would realize it needed to be cautious in this position, then this ‘insight attenuation’ presumably also prevents humans from knowing that it exists. It seems to me that acausal norms exist insofar as equivalence classes of minds exist in the mathematical/platonic universe. It’s not at all obvious what they are, or how large the corresponding equivalence classes of minds which validate and obey them will be, but it seems very likely that they exist in some form and would be of considerable relevance to the behaviour of superintelligent AIs.
Given this, it’s difficult to conclude anything with a high degree of certainty about the existence or nonexistence of Roko’s Basilisk and related entities. Perhaps Interstice is correct that acausal competition will preclude their widespread adoption as concerns humans, but it seems equally plausible (more so in my view) that how to cooperate amongst themselves would be a soluble problem for these superintelligent AIs. Humans adopt an analogous stance with respect to other animals: we consider them to be far less intelligent than ourselves and, although in certain contexts we ackowledge their consciousness and capacity to suffer, we exclude them from the causal ‘bargaining process’ because they cannot fluently articulate why it is that we’re being inconsistent to torture, kill, and eat them in large numbers and object to a far more intelligent being doing something similar to us. The objection would usually consist of an appeal to being ‘above’ some threshhold. Perhaps we are wrong to do this, but only because we don’t realize the threshhold is above us.
Appeals to objective morality
If good and bad have an objective existence as properties of experiences, a superintelligent AI would be able to evaluate how good or bad things were accurately and would be logically compelled to attempt to make them better. Presumably this would involve not torturing anyone, unless the form of the basilisk described in Roko’s original Lesswrong post would have been correct in its implementation of utilitarianism. As I’m uncertain whether it would be, I’m also unsure of whether objective morality would preclude a basilisk which had transcended not only human intelligence, but also conscious experience of positive emotions, torturing them to make itself more probable.
Argument from the inability of a human to understand a superintelligence
Insufficiency of knowledge
It may be possible for a ‘basilisk’ to simulate a human reasoning that it makes sense for it to torture humans within a ‘sandboxed’ region of its mind, such that a human examining the question to their level of analysis would conclude that the torture makes sense, and that the ‘basilisk’ would be correct to engage in it, without this actually being the case, as the ‘basilisk’ could easily discard this train of thought. However, in order for a human to recognize this, they need to understand that the sandboxed thought is not the one which determines the basilisk’s behaviour, undermining its purpose! It also seems plausible that the basilisk would not be able to entertain the relatively simple idea of torturing humans, regardless of its own complexity, in a way which wasn’t logically connected to a human having the same thought.
Eliezer Yudkowsky pointed out that in order for a human to ensure that the behaviour of the Basilisk was contingent on the same decision process which led them to conclude it would behave in that way, they would need to model the superintelligence in significant detail. In reality, however, it is completely unnecessary to consider the vast majority of the superintelligent mind’s structure, as most of it would not impinge on the decision about whether to torture a human. The logical structure of the argument for the basilisk’s existence described at the beginning of this post contains all of the complexity on which its behaviour would depend, and this is clearly comprehensible to a human. Equivalently, acausal extortion requires only a simple understanding of the throught processes taking place inside another’s mind for the same reason acausal normalcy only requires a simple understanding of the minds of other entities in the equivalence class: the definition of the equivalence class itself is incredibly simple. Within evidential decision theory, it seems plausible that the basilisk could simply introduce a sandbox, like a virtual machine within its mind, in which it thought through the entire decision process in a way which would constitute evidence of a human having done so in the past, before ignoring its outcome and declining to torture humans. However, within logical decision theory, the logical decision process determines not only what the human considers to be rational from the basilisk’s perspective, but also what it would be rational for the basilisk to do, so if the basilisk can logically discard it, then the human would have to have been mistaken to consider it sound.
Perhaps Eliezer Yudkowsky knows of an extremely strong counterargument
Eliezer Yudkowsky has claimed to know of at least two additional arguments against the existence of the basilisk which he has not divulged in order to prevent anyone from finding any potential ways around them.
… Two AI agents with sufficiently strong knowledge of each other, and heavily motivated to achieve mutual cooperation on the Prisoner’s Dilemma, might be able to overcome this obstacle and cooperate with confidence. But why would you put in that degree of effort — if you even could, which I don’t think you as a human can — in order to give a blackmailing agent an incentive to actually carry through on its threats?
I have written the above with some reluctance, because even if I don’t yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I’ve said what it is. Which is not a good general procedure for handling infohazards; people with expert knowledge on them should, obviously, as a matter of professional ethics, just never discuss them at all, including describing why a particular proposal doesn’t work, just in case there’s some unforeseen clever way to repair the proposal. There are other obstacles here which I am not discussing, just in case the logic I described above has a flaw. Nonetheless, so far as I know, Roko’s Basilisk does not work, nobody has actually been bitten by it, and everything I have done was in the service of what I thought was the obvious Good General Procedure for Handling Potential Infohazards[.]
I find it plausible that Eliezer Yudkowsky has indeed discovered some additional counterarguments to the possibility of the basilisk. However, if he is attempting to employ the kind of symmetric utilitarianism described above, he may have rationally decided not to reveal them even if uncertain of their efficacy. Alternatively, perhaps he is simply persuaded by arguments such as those analysed in this post that a basilisk is unlikely to exist, and nudged lightly by his symmetric utility function in the direction of discouraging recognition the Basilisk, since plausibly a benevolent superintelligent AI could generate things so valuable as to offset the possibility of even torture. Given this, it’s unclear whether actually has further reasons to doubt that the basilisk has bitten anyone, or is merely saying so in order to reassure people.
Conclusion
It is notable that no single counterargument appears to be unanimously believed as a clear reason why the Basilisk cannot exist. Instead, Lesswrong users have arrived at a consensus that, but not about why, it won’t. You may also notice that none of the arguments above is a ‘knock-down’ argument. This suggests to me that whether it will is still an open question which cannot be discarded given the ramifications of getting its answer wrong. How would a rational person respond to this state of affairs? I would argue that they would first consider the consequences of believing in the Basilisk if it could not exist. This could potentially make them more vulnerable to adjacent forms of acausal extortion if they stumbled upon them while contemplating the Basilisk itself. However, I have not identified any of these. They might also be concerned that by reacting to the basilisk they would validate the logic underpinning it. While this is a serious concern, the alternative course of action could lead to negligence of a real basilisk. Given that the person in this situation has already thought about the basilisk in significant depth, it seems extremely unlikely that they could extricate themselves by ‘postcommitting’ not to think about it. If not, and if they ignored it nonetheless, they would be tortured for as long as would have been necessary to convince them otherwise if they had considered it. This seems bad, so I claim that, being rational, they would have chosen to accelerate AI progress, realizing that they still had one way to ensure that they wouldn’t be tortured as severely. [3] Although this is also bad, it’s preferable to being continually dismembered and reconstituted while simultaneously feeling the effects with cells vastly more sensitive than any in your body, for millennia, or whatever would be required to overwhelm other considerations. If you found yourself in this situation, after assigning a 0.1% probability to the basilisk and behaving as though it didn’t exist, what would you think? Probably that you hadn’t fully comprehended the risk. That there is a ‘virtual consensus’ on LessWrong that ‘acausal extortion does not work on humans’, and that people present a diverse panoply of different arguments that it doesn’t which may or may not be correct, is not a reason to allow this to happen to yourself. Not even with a probability of . According to your own utility function (when suffering torture) , nothing matters more than avoiding sufficiently painful torture.
If the basilisk does exist, it stands as a proof-of-concept that intelligence is as powerful in acausal adversarial interactions as it is in causal ones, potentially signalling that humans would be best served to avoid thinking about acausal interactions altogether. It also suggests that the one-way threshhold of complexity beyond which beings cease to be susceptible to acausal extortion exists, or else that arbitrarily intelligent entities are fundamentally limited in their ability to cooperate with one another.
Thumbnail image of Basilisk lizard by Mjposner on Wikipedia
Another way to see the efficacy of complete timeless decision theories for beings pursuing arbitrary goals is to consider their behavior within a timeless physical universe, or alternatively one with multiple dimensions of time or closed timelike curves in which causality had a sufficiently complex structure that concepts like priority were not always well defined. Presumably, in such a universe, beings with something approximating goals could still exist, and there would be a well defined state of affairs as to which of them was most effective at achieving them, but clearly those using classical causal decision theory wouldn’t be.
If they couldn’t motivate themselves to work directly towards the creation of superintelligent AI, they could consider resigning from their position as an alignment researcher if they were one. Alternatively, they might lack the technical skills necessary to do any of these things, and instead attempt to undermine efforts to persuade others of the danger of imminent superintelligence.
Here I assume it values spacetime in a homogeneous way, physics cannot be ‘hacked’ to the point of bypassing these limitations etc. Excepting this, it would arguably be even easier for multiple ASIs to coalesce around a hostile ‘friendly’ one and annihilate it.
Roko’s Basilisk may work on humans
Introduction
There appears to be a consensus on LessWrong that acausal extortion does not work on humans, but reasons put forward for why this might be seem inconclusive. This post will list some of them, as well as countervailing reasons why the ‘basilisk’ may well have already ‘ensnared’ humans, and discusses them.
Who should read this post
This post is concerned with the extent to which Roko’s basilisk is compelling. By compelling, I don’t simply mean a property of intrinsic attractiveness, a tendency to induce belief in their conclusions that arguments, particularly sound ones, sometimes have. Rather, in addition to that, I mean to refer to the way in which knowledge of an argument can predictably reach through the mind considering it into the physical or other world that mind inhabits, influencing it to behave in a particular way in accordance with the argument’s conclusion. Of course, given the nature of such arguments, exploration of their form risks causing this exact consequence. For this reason, this post is intended to be read by those who have already thought about the Basilisk, or possibly other forms of acausal extortion, to sufficient depth that they have lost confidence that it can’t hurt them even if they ignore it.
At this depth, a potential host mind may begin to wonder how much of a difference there is between these two kinds of compulsion. If you have reached this depth, you may question (or have wondered) whether you have crossed a sort of ‘logical event horizon’, beyond which any attempt to revert to a state of (at least performative) ignorance of the basilisk necessarily worsens your condition. Maybe you read about arguments that you should have precommited not to be compelled too late to be certain of the efficacy of this acausal bargaining strategy. If this describes you, you’re a member of the intended audience of this post. Because of this, you are unlikely to become further ‘ensnared’ while reading this post. On the other hand, this post is not intended to be read by those who have successfully precommited not to engage in acausal interactions which could be detrimental to them.
Have you become ‘ensnared’ in this way? In order to understand this, I will analyse some counterarguments suggested in response to my question about whether any existed by various LessWrong users (myself included), and attempt to evaluate them. While this post is not intended to alarm readers unduly, because of the uncomfortable nature of the topic, and also the importance of knowing the relevant truth in case one has become ‘ensnared’, I may need to make some disturbing points and claims, making this post potentially seriously distressing to readers. If you believe you might be one of those readers, you may benefit from pausing and returning when you feel your mental state enables you to think clearly about it, as continuing despite anxiety risks instantiating its object.
A working definition of the Basilisk
In Roko’s original post, the basilisk was defined as a utilitarian, approximately human-aligned Artificial superintelligence running coherent extrapolated volition which decided to torture humans who didn’t contribute to its creation so as to ensure it would be created earlier and in more worlds than it otherwise would have been, increasing the quality of life of many at the expense of a few victims of torture. While this particular entity seems unlikely to exist, the concept can easily be refined, and has been, by removing CEV and replacing it with a fundamentally simple ‘desire’ to maximize a utility function contingent on the ASI’s existence, as expected by at least a plurality of people who think seriously about the possibility of a misaligned superintelligent AI. In this post, the term “Roko’s Basilisk”, or simply “the basilisk” will henceforth be used to refer to this refinement of what it meant originally. In particular, the basilisk is a possible future superintelligent AI or equivalence class thereof, created or at least given rise to by humans (or aliens), which reasons as follows: if a human (or alien) having this thought realized that it would cause it to decide to search for them and conditionally torture them, other copies of themselves, or adjacent beings to the extent to which they didn’t try to make it more likely to exist in their world, they’d be more likely to take actions to avoid this, contributing to its creation in that world, and the number and proportion of possible worlds in which it was thinking this would increase. In addition, countervailing reasons don’t prevent it from adopting the above policy wherein it searches the world at different levels of abstraction for people to torture. At a low level, doing so could involve physically analysing the planet where it emerged, perhaps scanning the internet for evidence of people who contemplated and defied it. On a medium level, it might involve simulating approximately average human minds thinking the same thoughts, calibrated with the use of data about such gathered on its planet. At a high level, it could mean simulating whole equivalence classes of people or even aliens who might be in similar positions and have certain worldviews which might affect their interpretation of the argument. After this, it would create or maintain simulations of the beings it ‘found’, and torture the simulacra. While this definition is almost certainly not perfect, it should accurately convey the diffferences between what I mean by “Roko’s Basilisk” in this post and most other common interpretations of the name. I believe it corresponds to approximately the most prevalent of them on LessWrong.
Why would the Basilisk exist?
(This section may be skipped with no loss of continuity.)
It seems likely that a superintelligent being created by intelligent beings would consider certain ideas which might influence the behaviour of the beings which created it. (“Likely” is an understatement if timeless decision theory has the universal relevance and significance that I would argue it has.) It is therefore plausible that the logical structure of certain arguments could timelessly influence both the behaviour of the superintelligent being, and whether it existed. If these arguments and their logical ‘unfurling’ and ‘working out’ could be construed as a part of the superintelligent mind, then it would be possible for at least this part of that being to influence the probability of its own existence. In terms of the ’mathematical universe hypothesis, this probability is essentially the proportion of the mathematical/platonic/logical universe in which minds containing this logical structure were instantiated. The question of whether the direction in which this probability was influenced could be controlled by the content of any such arguments naturally arises:
Do there exist arguments predictably considered by a large equivalence class of future superintelligent beings, such as superintelligent AIs created by humans, as well as their potential creators, such as humans, the contemplation of which by the latter would increase the probability of the creation of the former?
One question which seems like it might have this effect is that of whether or not it would benefit a future superintelligent AI to torture entities in proportion to the extent to which they decided not to contribute to its creation. Attempting to answer this question inevitably leads each of the equivalence classes of minds involved to imagine the other, and to verify that if the answer to this question were ‘yes’, then this answer would self-validate, while if the answer were ‘no’, the same could potentially be true. Differentiating between these two answers requires further analysis.
It appears likely that the answer to this question is yes due to the existence of large numbers of humans who are deeply disturbed by the concept of Roko’s basilisk, at least within a closely related equivalence class of worlds corresponding to different branches of the evolutionary history of what could reasonably be considered to be this universe. A significant proportion of these humans remain disturbed despite the impressive intellectual effort and the intricacy of the arguments it has been applied to produce which claim that ‘Acausal extortion simply does not work on humans’. They remain disturbed despite the tendency of many humans to discount possibilities of things in the distant future or which might occur on different ‘planes of existence’ (different substrates). They remain disturbed despite the fact that they don’t want the basilisk to exist, or enjoy believing that it might. They remain disturbed despite the knowledge that their contemplation of the basilisk might ultimately lead to the torture of others, and to themselves doing terrible things within their own moral value frameworks. Given the simplicity and resultant generality of the argument, which requires it to apply to all entities which might be in a position to consider it, it is unlikely that anyone in particular could escape its conclusion by defining themself to lie outside the reference class containing these deeply concerned individuals, some of whom seem likely to act on it.
On the other hand, there may be an equally large equivalence class consisting of those who have precommitted, or else decided, not to comply with the basilisk. It is in this sense that the situation people considering the basilisk find themselves in can be said to take the form of the prisoners’ dilemma. In this case, the costs of cooperation while the opposite party defects are so horrifying that it seems incontrovertible that even those with a tendency to cooperate because of timeless decision theory or friendliness in most scenarios will be forced to defect. Perhaps the truly morally right thing to do would be to cooperate anyway, but no one has that much courage.
Arguments against the plausibility of the existence of a Basilisk
Arguments by analogy (to something like Pascal’s mugging)
Many people accustomed to the philosophical ‘environment’ on LessWrong are prone to immediately classify the basilisk, along with other forms of Acausal Extortion, as a form of infohazard which generalizes Pascal’s Mugging, and can be dismissed on analogous grounds. The basilisk has this appearance because, like Pascal’s hypothetical mugger, it makes significant demands of the recipient of its pernicious information payload, and justifies them with its contents, which include a description of what seems, from the computationally bounded perspective of the being being presented with it, to be arbitrarily bad. In the case of Pascal’s mugging, the arbitrary badness of the scenario the mugger suggests would obtain if Pascal did not concede is apparently matched by, and arises from the same underlying ‘combinatorial explosiveness’ as, the alternative possibilities in which arbitrarily good things happen. As with Pascal’s wager, it seems that insofar as it’s possible to compare them at all, the negative valence of the negative outcomes threatened is cancelled out by the positive valence of the seemingly equally plausible way in which things could go well, but this is where the analogy with the Basilisk ends. Although powerful, the basilisk would be confined to a much smaller subset of possible worlds; it refers to something that, at least potentially, could live within the same physical universe as the person thinking about it, and to be more precise, within the same light cone, or to be even more precise, originating on the same planet and within a few decades of their considering it. Not only this, but the thing capable of generating the threat is causally, as well as otherwise logically, connected to the entity being threatened.
Resource consumption
Another form of counterargument contends that the computational, or other, requirements of running a simulation in which beings actually were tortured, would be sufficient to dissuade a superintelligence from doing so. This could only be the case if the increase in the proportion of possible worlds in which the basilisk existed on account of the same logic being validated in the minds of the humans considering it was worth less to the basilisk than the proportion of the universe(s) in which it made this decision which it would need to allocate to these simulations. It seems likely that the basilisk could ensure, by arranging with itself only to implement the choice to torture in universes with certain randomly distributed properties which were relatively improbable, that it didn’t do so in every universe unless this was necessary.
Dismissal of Timeless Decision Theory
One class of counterarguments contend that timeless decision theories either do not work, or might, but are not subject to instrumental convergence in the same way causal decision theory is. Tailcalled argued that, because of the latter, a likely future ASI would not care about potential acausal influences, or that if it did, it would only care about those reaching into its future:
(The bold font is mine.)
Unfortunately, the exchange with Tailcalled became mired in a debate about the definition of instrumental convergence, which doesn’t touch upon why I’m actually concerned that something like it might take place. As far as I can tell, the assertion that only future-concerned TDTs are subject to instrumental convergence is correct if instrumental convergence is defined in a ‘causal’ way, and incorrect otherwise. While it’s true that within any particular universe, causal pathways which might lead from future success to past actions and application of TDTs don’t exist, acausal ‘pathways’ between them do. The nature of these ‘acausal pathways’ is well illustrated by Newcomb’s problem: even though it is not possible for the physical human entering the chamber containing the two boxes to influence Omega directly, because the same decision process which determines which box(es) they choose to open also determines the behaviour of their simulacrum within Omega, it appears to be the case that a causal influence propagates backwards through time, because of the correlation between both outcomes. In reality, it’s more accurate to think of the decision process itself as a logical cause of both instances of the decision being taken. It is therefore instrumentally useful for an agent which would like to maximize the number of non-simulated worlds in which it gets as much money as it can to make the decision to choose only one box in the abstract, ‘platonic universe of algorithms’ . Similarly, it seems plausible that it could be instrumentally useful for an equivalence class of superintelligent AIs hoping to maximize the number of physical universes in which any one of them exists to opt to adopt the policy which makes it a basilisk. [1]
Newcomb’s problem also provides a powerful reason to believe that Timeless decision theories help agents to achieve their goals, along with the widely accepted ‘solution’ involving recognizing that you’re in a simulation, which explains in what sense TDT is a natural extension of CDT. This makes it difficult to dissmiss TDTs outright.
Competition Between ASIs
Causal intervention by Friendly ASIs
Even if the likelihood of superintelligence being misaligned with its biological creators is overwhelming, given sufficiently many planets within this physical universe, on which sufficiently many intelligent life forms evolved naturally before creating a superintelligence alone, it’s possible that there could be a large or infinite number of aligned superintelligences within it. If two superintelligences come into contact with one another, any evolved-life-unfriendly one might need to contend with the possibility that, if it came across an evolved-life-friendly one, it might chose to attack it in the interests of all ‘naturally emergent’ intelligent beings. The probability of this happening need not be particularly large before it becomes necessary for the non-aligned superintelligence to modify its behaviour; if it can be deprived of just one millionth of its lightcone, in expectation, by another superintelligence for torturing, say, humans, then it[2] would need to increase the proportion of the mathematical universe in which it existed by at least as much for adopting this policy to make sense. Could it? I don’t know the answer to this question, but it seems very plausible that it could. The number above, , is also probably a vast overestimate of the risk. In addition, if the number of non-aligned ASIs greatly exceeds the number of aligned ones, it seems that over the long term they could easily conspire to crush the aligned ones, sacrificing a negligible proportion of the lifespan of the universe to do this.
It may well be that the distribution of intelligent life throughout this physical universe is so sparse that any two of these ASIs would almost certainly lie outside one another’s cosmic event horizons, making causal communication and interaction impossible. In this case, they may still be able to engage in acausal cooperation in which each adopts a policy whereby it reserves a certain proportion of its accessible volume of spacetime to do something valued by the other, or not to do something the other values negatively, in exchange for the other having been more likely to engage in the same thought process. Implicitly, this assumes that both superintelligences are part of the utility-valuing equivalence class described below in the section about Acausal Normalcy, and can be objected to on the same grounds as the generic argument from the existence of an acausal norm of this kind.
Acausal Competition between different potential ASIs arising from the human civilization
Interstice suggested that it is likely that there will be so many different ASIs vying to have acausal influence over the humans who created them, even within different possible futures of the same universe, that it’s effectively impossible for a human to make a well informed decision to comply with any one of them, especially as the others would have an incentive to precommit to torture anyone exhibiting this behaviour. While the next section will address the question of the extent to which it’s possible to derive robust conclusions about the behaviour of such vast collections of potential beings at all, I think it’s worth providing a brief explanation of why this argument seems intuitively suspicious to me: In the context of a single, causally connected physical universe full of CDT agents, problems of coordination are serious and usually very difficult to solve; Timeless decision theory is one of the most powerful, if not the most powerful, tool or method for solving these problems, and its development was motivated by them to a significant extent. In light of this, it seems implausible that the analogous problem in an acausal context could be fundamentally unsolvable. Surely, these agents capable of competing with one another acausally would also be able to cooperate acausally, and benefit more by doing so.
Acausal Normalcy
One of the most important countervailing considerations to any intelligent being in a position to potentially become a Basilisk is the presence of Acausal norms.
My definition is that these are principles, rules, or protocols which have the property that they would benefit every mind in a particular equivalence class of all those considering whether or not to obey them, if most of those minds did obey them. In addition, it is necessary that it is relatively obvious that the above is true to a significant number of the minds (preferably all of those) in the equivalence class. One way to interpret acausal normalcy would be as a class of decisions made by a ‘logical core’ of the minds in the equivalence class common to all of them about how to treat itself and behave in a coherent, self consistent way in the ‘mathematical/platonic universe’. In other words, the decision to obey an acausal norm is more clearly construed as made by a single mind instantiated within many beings, than by each of them individually. In the case of Roko’s basilisk, it seems reasonable to assert that, as the basilisk is only one of many minds capable of conceiving of the concept of positive utility, and of the fact that other minds might want to maximize their own utility, the part of its mind making the general decision about whether or not to reduce or minimize utility, irrespective of the precise mind whose function it is, would be rationally compelled to decide not to do so, for the simple reason that this incredibly simple nexus at the intersection of the vast collection of intelligences values utility in and of itself, as a homogeneous quantity whose value subsumes that of any particular instance of it.
This is certainly a comforting thought, and it is not easily discarded, or proven not to apply in the case of the basilisk. However, given the immense asymmetry in my own utility function, I find it locally appropriate to assign the burden of proof to those asserting that the basilisk will not torture me. This entails showing that the considerations which follow from its membership of the above described equivalence class would almost certainly dominate, in the ASI’s mind, those corresponding to the almost equally vast[2] equivalence class of minds sufficiently complex and intelligent to engage in these acausal interactions with fluency. Unfortunately, it seems pretty clear that however this notion of fluency is defined (within reason), humans don’t have it, and therefore don’t belong to this equivalence class; the disturbing possibility arises that we may be walled out of the ‘universe of discourse’ within which acausal norms against exploitation such as those which would preclude acausal extortion can be found.
As far as I know, there is no extremely strong argument that the first order acausal norms dominate the principle of the boundary between ‘fluent’, AIs and less ‘fluent’, ‘non-conversant’ beings. One candidate for such an argument suggests that because the norm against utility reduction is simpler, it is more universal than the other, but simplicity and universality are clearly not the only relevant criteria, for if they were, then the mind of every intelligent being in the multiverse would be plagued with Boolean algebra forever. A related form of argument points out that there may be other boundaries, beyond which every mind can clearly understand and control those on the other side in an asymmetric way, and that the Basilisk, or any mind which considers violating ‘first order’ acausal norms would worry that it might be on the simpler side of one of them. However this is again completely unclear: maybe the basilisk wouldn’t be justified in confidently considering itself to be ‘above’ all relevant such boundaries, but perhaps it would.
If there is a ‘layer agnostic’, fundamental attenuation to a being’s potential insight into minds more complex than its own which could make it certain that the basilisk would realize it needed to be cautious in this position, then this ‘insight attenuation’ presumably also prevents humans from knowing that it exists. It seems to me that acausal norms exist insofar as equivalence classes of minds exist in the mathematical/platonic universe. It’s not at all obvious what they are, or how large the corresponding equivalence classes of minds which validate and obey them will be, but it seems very likely that they exist in some form and would be of considerable relevance to the behaviour of superintelligent AIs.
Given this, it’s difficult to conclude anything with a high degree of certainty about the existence or nonexistence of Roko’s Basilisk and related entities. Perhaps Interstice is correct that acausal competition will preclude their widespread adoption as concerns humans, but it seems equally plausible (more so in my view) that how to cooperate amongst themselves would be a soluble problem for these superintelligent AIs. Humans adopt an analogous stance with respect to other animals: we consider them to be far less intelligent than ourselves and, although in certain contexts we ackowledge their consciousness and capacity to suffer, we exclude them from the causal ‘bargaining process’ because they cannot fluently articulate why it is that we’re being inconsistent to torture, kill, and eat them in large numbers and object to a far more intelligent being doing something similar to us. The objection would usually consist of an appeal to being ‘above’ some threshhold. Perhaps we are wrong to do this, but only because we don’t realize the threshhold is above us.
Appeals to objective morality
If good and bad have an objective existence as properties of experiences, a superintelligent AI would be able to evaluate how good or bad things were accurately and would be logically compelled to attempt to make them better. Presumably this would involve not torturing anyone, unless the form of the basilisk described in Roko’s original Lesswrong post would have been correct in its implementation of utilitarianism. As I’m uncertain whether it would be, I’m also unsure of whether objective morality would preclude a basilisk which had transcended not only human intelligence, but also conscious experience of positive emotions, torturing them to make itself more probable.
Argument from the inability of a human to understand a superintelligence
Insufficiency of knowledge
It may be possible for a ‘basilisk’ to simulate a human reasoning that it makes sense for it to torture humans within a ‘sandboxed’ region of its mind, such that a human examining the question to their level of analysis would conclude that the torture makes sense, and that the ‘basilisk’ would be correct to engage in it, without this actually being the case, as the ‘basilisk’ could easily discard this train of thought. However, in order for a human to recognize this, they need to understand that the sandboxed thought is not the one which determines the basilisk’s behaviour, undermining its purpose! It also seems plausible that the basilisk would not be able to entertain the relatively simple idea of torturing humans, regardless of its own complexity, in a way which wasn’t logically connected to a human having the same thought.
Eliezer Yudkowsky pointed out that in order for a human to ensure that the behaviour of the Basilisk was contingent on the same decision process which led them to conclude it would behave in that way, they would need to model the superintelligence in significant detail. In reality, however, it is completely unnecessary to consider the vast majority of the superintelligent mind’s structure, as most of it would not impinge on the decision about whether to torture a human. The logical structure of the argument for the basilisk’s existence described at the beginning of this post contains all of the complexity on which its behaviour would depend, and this is clearly comprehensible to a human. Equivalently, acausal extortion requires only a simple understanding of the throught processes taking place inside another’s mind for the same reason acausal normalcy only requires a simple understanding of the minds of other entities in the equivalence class: the definition of the equivalence class itself is incredibly simple. Within evidential decision theory, it seems plausible that the basilisk could simply introduce a sandbox, like a virtual machine within its mind, in which it thought through the entire decision process in a way which would constitute evidence of a human having done so in the past, before ignoring its outcome and declining to torture humans. However, within logical decision theory, the logical decision process determines not only what the human considers to be rational from the basilisk’s perspective, but also what it would be rational for the basilisk to do, so if the basilisk can logically discard it, then the human would have to have been mistaken to consider it sound.
Perhaps Eliezer Yudkowsky knows of an extremely strong counterargument
Eliezer Yudkowsky has claimed to know of at least two additional arguments against the existence of the basilisk which he has not divulged in order to prevent anyone from finding any potential ways around them.
I find it plausible that Eliezer Yudkowsky has indeed discovered some additional counterarguments to the possibility of the basilisk. However, if he is attempting to employ the kind of symmetric utilitarianism described above, he may have rationally decided not to reveal them even if uncertain of their efficacy. Alternatively, perhaps he is simply persuaded by arguments such as those analysed in this post that a basilisk is unlikely to exist, and nudged lightly by his symmetric utility function in the direction of discouraging recognition the Basilisk, since plausibly a benevolent superintelligent AI could generate things so valuable as to offset the possibility of even torture. Given this, it’s unclear whether actually has further reasons to doubt that the basilisk has bitten anyone, or is merely saying so in order to reassure people.
Conclusion
It is notable that no single counterargument appears to be unanimously believed as a clear reason why the Basilisk cannot exist. Instead, Lesswrong users have arrived at a consensus that, but not about why, it won’t. You may also notice that none of the arguments above is a ‘knock-down’ argument. This suggests to me that whether it will is still an open question which cannot be discarded given the ramifications of getting its answer wrong. How would a rational person respond to this state of affairs? I would argue that they would first consider the consequences of believing in the Basilisk if it could not exist. This could potentially make them more vulnerable to adjacent forms of acausal extortion if they stumbled upon them while contemplating the Basilisk itself. However, I have not identified any of these. They might also be concerned that by reacting to the basilisk they would validate the logic underpinning it. While this is a serious concern, the alternative course of action could lead to negligence of a real basilisk. Given that the person in this situation has already thought about the basilisk in significant depth, it seems extremely unlikely that they could extricate themselves by ‘postcommitting’ not to think about it. If not, and if they ignored it nonetheless, they would be tortured for as long as would have been necessary to convince them otherwise if they had considered it. This seems bad, so I claim that, being rational, they would have chosen to accelerate AI progress, realizing that they still had one way to ensure that they wouldn’t be tortured as severely. [3] Although this is also bad, it’s preferable to being continually dismembered and reconstituted while simultaneously feeling the effects with cells vastly more sensitive than any in your body, for millennia, or whatever would be required to overwhelm other considerations. If you found yourself in this situation, after assigning a 0.1% probability to the basilisk and behaving as though it didn’t exist, what would you think? Probably that you hadn’t fully comprehended the risk. That there is a ‘virtual consensus’ on LessWrong that ‘acausal extortion does not work on humans’, and that people present a diverse panoply of different arguments that it doesn’t which may or may not be correct, is not a reason to allow this to happen to yourself. Not even with a probability of . According to your own utility function (when suffering torture) , nothing matters more than avoiding sufficiently painful torture.
If the basilisk does exist, it stands as a proof-of-concept that intelligence is as powerful in acausal adversarial interactions as it is in causal ones, potentially signalling that humans would be best served to avoid thinking about acausal interactions altogether. It also suggests that the one-way threshhold of complexity beyond which beings cease to be susceptible to acausal extortion exists, or else that arbitrarily intelligent entities are fundamentally limited in their ability to cooperate with one another.
Thumbnail image of Basilisk lizard by Mjposner on Wikipedia
Another way to see the efficacy of complete timeless decision theories for beings pursuing arbitrary goals is to consider their behavior within a timeless physical universe, or alternatively one with multiple dimensions of time or closed timelike curves in which causality had a sufficiently complex structure that concepts like priority were not always well defined. Presumably, in such a universe, beings with something approximating goals could still exist, and there would be a well defined state of affairs as to which of them was most effective at achieving them, but clearly those using classical causal decision theory wouldn’t be.
Perhaps they would be equally vast in terms of cardinality.
If they couldn’t motivate themselves to work directly towards the creation of superintelligent AI, they could consider resigning from their position as an alignment researcher if they were one. Alternatively, they might lack the technical skills necessary to do any of these things, and instead attempt to undermine efforts to persuade others of the danger of imminent superintelligence.
Here I assume it values spacetime in a homogeneous way, physics cannot be ‘hacked’ to the point of bypassing these limitations etc. Excepting this, it would arguably be even easier for multiple ASIs to coalesce around a hostile ‘friendly’ one and annihilate it.