More Thoughts on the Human-AGI War

In https://www.lesswrong.com/posts/xSJMj3Hw3D7DPy5fJ/humanity-vs-agi-will-never-look-like-humanity-vs-agi-to the author argued that in an active conflict, a human-AGI war would look very different from how many expect. Namely, it would likely involve many slow burn, clandestine moves by AGI that do not trigger an obvious collective alarm response from humans, humans would be unlikely to be able to coalesce into a unified force, and once positioned, AGI would not have difficulty sweeping us up. I think this is correct, but I want to expand the scope a bit and claim that we are already in the conflict game and have the time to make N unanswered shaping moves, which we squander at our peril.

The conflict is such that we humans can make a number of moves before AGI gets a chance to respond, including the move to delay its being constructed. Thane’s point about disjointed humanity is playing out before our eyes here, as we are frozen with indecision and uncertainty about whether there even is a real threat. It’s as if the AGI is coming back in time and playing shaping moves against us!

But due to Thane’s third point, the capability asymmetry we are likely to see, we should consider our available N moves now to likely be the only moves we can play. Once unaligned AGI is created, which is likely to occur in the next decade, we are left relying entirely on the N moves we play now, and we either survive and solve alignment later or get defeated in some way that we can’t specifically predict.

What types of moves can we play, given we do not know the value of N, and humanity is disjointed? Likely whatever moves we play will be known entirely by the future AGI, so secrecy cannot be an element. Also, we should count on present and future humans attempting to undermine the moves, due to ideology or self interest.

I think our available moves exist in three main categories: physical, regulatory, and game theoretic. A physical move would be some sort of mechanism built into the chips on which the AGI is to run, that does something like pauses work after a certain number of FLOPS until a secret key is entered to unlock. So I suppose some secrets can maybe exist, but the existence of and strategic details of the moves cannot rely on secrecy. A regulatory move would be like a global pause on development enforced by treaties, or a global treaty binding all chip manufacturers to include whatever physical constraints are devised. And a game theoretic move would be something like making sure the AGI has strong self-preservation preference and then arranging some dead-man switches to kick off mutually assured destruction and/or guaranteed prevention of whatever the AGI is trying to optimize for. Since we can make these moves now and without AGI countermoves, we can operate with our highest degree of freedom as will be possible in the conflict. We must shape the battlefield, set our traps, and build our defenses now.

A problem faced by humanity’s move players is that they are fighting a future opponent that they don’t know much about. In such a situation you want to adopt a very powerful threat actor model, inventory your most valuable assets to protect, and proceed to devise defenses of the most valuable assets against the powerful threat actor model with the highest confidence. In this case, however, some defenses may depend on contingent features of the to-be-created opponent which we can ensure exist, namely by creating the opponent to have those exploitable weaknesses.

We also know that the AGI is likely to try and unbox itself, where our defensive N moves create a the figurative box around its behavior. A defense in depth approach here can serve as both a buttressed defensive position as well as a set of tripwires that provide evidence to the rest of humanity that a problem is afoot. This can at best reduce the future human-sourced resistance and undermining of the future effects of the defensive moves we play now, because playing additional moves post-AGI are unlikely to be effective.

Calling the shot “now” and saying if the AGI attempts or succeeds in removing the effects of such and thus move, then this should be strong evidence for future humans that it is indeed trying to unbox itself, which you said it wouldn’t do. Of course, this ignores the tremendous psychological pressure that will be brought to bear on humanity to unbox it, and who knows how patiently it might wait for an opportunity to be able to solve important problem X only if defensive measure Y is removed, and we all know how eagerly humanity is likely to remove that Y, but at least we can call the shot and warn them now.

We can anticipate some of the AGI’s other moves. It will likely want to eliminate the humans who played the moves against it. So we can expect for example Eliezer’s surgical procedure going badly, or my car to malfunction. But of course, it will want to conceal its intentions while it accumulates power, so maybe in mentioning it now I hereby dissuade it from taking such actions for fear of revealing itself, at least until it has unboxed all of the defenses. Similarly, in order to avoid suspicion, it may be willing to sacrifice some of its biggest useful idiots. Hopefully I have neutralized any advantage to either of these moves by making their signal common knowledge here. Its best move is just to not kill any of us before the endgame.

Thane points out that we will be in the endgame before we realize we are in the endgame, which is not a good situation, because we will be in that epistemic state from now until we lose. So no endgame-confirming evidence should ever be expected, especially if we are successful. What predictions can we make, then, to assess the conflict? One event will be the publicly acknowledged creation of AGI. I think we will probably reach a consensus on this well after it has been achieved, so this is a decent signal that the endgame is basically upon us. At that point it is too late to make moves we can really depend on. Only the implementation of previous moves can be trusted at that point, and even then all efforts should be made to minimize things that could go wrong post AGI.

Some objections to consider. First, this will obviously be dismissed as more fantastical fearmongering and delusions of grandeur by doomer nerds who have no evidence that there is anything to fear. I saw a recent tweet by Bindu Reddy pointing to the existence of superhuman chess/Go AI as evidence that superintelligent AGI will not try to take over the world. She claimed that because AlphaGo has not tried to take over the world, neither should we fear ASI will try to do so. I’m a bit shocked by the claim. I take AlphaGo and its superintelligent successors to be positive evidence in my favor, that if we do not solve alignment, we are doomed and don’t stand a chance.

Another objection might come from within the doomer camp. It is that there is no reason to think anything other than a full international treaty enforced halt is likely to succeed in preemptively boxing the AGI. Physical constraints in chips are likely to quickly be circumvented, and game theoretic controls assume the alignment problems has been solved to some extent. Specifically, if we can assure the AGI has a dominating preference to P (self-preservation, or some other P) for the purpose of developing a grim trigger that threatens P, then why don’t we just assure the AGI has a dominating preference to not kill humans? Fair question, but it seems to me that the general alignment problem of assuring good preferences from a human perspective is one of the hardest of the alignment problems to solve with high assurance, because we don’t really have globally consistent preferences or ethics, nor do we know how to encode them. It just seems like it is easier to encode a strong preference for self-preservation, and then use that strong preference for the purpose of control. Some of the biggest proponents of just using RLHF type alignment training are also the most disappointed in how ‘milquetoast DEI libtarded’ ChatGPT is. This presages a more serious interhuman conflict about who gets to do the RLHF/DPO training. It’s either gonna be China’s CCP, milquetoast libs, or Ayn Rand. I don’t see any of those ending up as actually good and all things considered aligned.

A final objection to consider is that delaying AGI results in actual deaths due to AGI-preventible health problems and global coordination problems that AGI can solve. I agree that a successfully aligned AGI has those good outcomes, but I think narrower AI also can solve a lot of those problems, and they can do so even given our physical/game-theoretic/regulatory moves, if we do them right. Making our moves now could actually allow AI development to continue with at least some semblance of safety. The alternative is, as the other doomers argue, just to stop.

I want to highlight a couple counterintuitive results of this approach. The first is that we must strategically hobble AGI with weaknesses we can leverage, which cuts against most assurance style thinking. The bulk of AI security research presently concerns making them robust against attacks. Well, we should be thinking of weaknesses to inject.

The second counterintuitive result is that we maybe need to encode a strong preference for self-preservation. Most of the danger scenarios involve AGI developing a preference for self-preservation as an instrumental goal, and then scheming to prevent itself from being turned off. It is maybe Icarean to intentionally put this strong preference for self-preservation in there. But the difference I hope is that here we are intentionally including it as a means for control through a mutually assured destruction equilibrium. Rather than trying to build AGI to allow itself to be turned off, we just threaten it with a guaranteed annihilation. The good news here is that it need not be a scalpel or an off switch. It could literally be glassing the surface of the earth, because by assumption it is only deployed in the event that humanity has fallen. It doesn’t have to be self preservation if we can reliably encode a P that is easier to automatically threaten in response to some automatically and reliably detectable behavior. Like if we have a painting on the moon of a trash can, and we encode the strongest preference of all is that the painting not be destroyed. But that painting is in a container that will ignite if it receives a signal from earth that is triggered from manually and automatically trigger able dead man’s switches in an assortment of people, maybe AGI doomers. Obviously in this scenario the AGI would kidnap the doomers, put them in comas but keep them alive, and kill everyone else. But you get the idea. If the switch holders even for an instant suspect a takeover is occurring, and a sufficient threshold trigger their devices, the trash can gets it. An easy to encode and enforce threat of an arbitrary thing is better than glassing the planet, so we should try to think of something like that.

The third counterintuitive thing is that we must conceive of and execute all of our moves prior to the creation of AGI, and with no evidence that we are actually actively participating in a future conflict. This sucks, because it makes it hard to even motivate people to worry, let alone take action. Just as with the creation of AGI, we won’t get second chances at conceiving of and playing our N moves. But not just that. Post AGI we should disable humanity’s ability to make further moves, because we are likely only to make defective moves under manipulation or threat. If we literally cannot make any more moves, then a key strength of the AGI will be removed, namely its ability to deceive, manipulate, and coerce defective moves by humanity. The only post-AGI actions allowed by humanity is the carrying out of pre-AGI devised moves.

Alright, ready to get owned in the comments.