Mitigating the damage from unaligned ASI by cooperating with aliens that don’t exist yet

MSRayne21 Jun 2022 16:12 UTC

−8 points

7 comments6 min readLW link

AI Acausal Trade

Epistemic status: Pretty sure the idea is sound, but its feasibility is unfortunately a bit questionable.

Basic idea

Humanity must as soon as possible send a message to civilizations beyond Earth containing the following:

Everything we currently know about the alignment problem, together with the bare minimum of AI knowledge necessary to understand it without drastically increasing capabilities as well;
Our precommitment that, if we solve it and create an aligned ASI, and later on in the future while expanding through the cosmos meet the expansion bubble of another ASI:
1. if the ASI is not aligned with the values of the civilization that created it, we will “defect” against it;
2. if the ASI is aligned, but the civilization that created it knew about our precommitment without making (and embedding in the ASI) the same precommitment themselves, we will still defect;
3. if the ASI is aligned and the civilization that created it did make the same precommitment (verifiable from the ASI’s source code when inspected by ours), *then* we will cooperate and willingly share the universe with them.

Why share alignment knowledge with aliens?

We care about their welfare, and more selfishly, about the possibility of someday meeting them. If an unaligned ASI kills them all, that would be a great loss to the universe and to us.
We want to minimize the amount of the universe eventually controlled by unaligned ASIs because their values tend to be absurd and their very existence is abhorrent to us.
Unaligned ASIs, even if made by aliens, pose a potential existential threat to our civilization (if they end up controlling more of the universe than us by the time we meet).
Most sophont species capable of forming a technological civilization sufficiently advanced to produce ASI are likely to be highly social and have values that are significantly more similar to ours than chance as a result (for instance, valuing life, fairness, beauty, etc), meaning that if they have an aligned ASI to help them maximize their values, that advances some of our values as well.

Why precommit to defect against unaligned ASIs when it’s superrational to cooperate? Warfare between ASIs risks astronomical waste!

Yes, that’s the point. This is a signal to any unaligned ASIs that do get made elsewhere in the universe that they may benefit from self-modifying to become at least partly aligned in order to avoid our wrath. If the loss to their current unaligned values after doing that is less than the loss from whatever destruction or other disvalue we could impose upon them, then they will do so.

What does “defect” even mean in this context?

Either physical warfare or the creation of huge amounts of the other entity’s disvalue, if its utility function goes into the negative range. (Is there such a thing as anti-paperclips? If so, in a war with Clippy, we’d fill the universe with those. Maybe I should ask DALL-E Mini for an angry crying wojak meme in the shape of a paperclip.)

All right, but why try to coerce them into being aligned with their creator civ’s values? Why not our own?

They don’t know what our values are from this far away in spacetime and we couldn’t explain them. If we could, we’d already have solved a large part of the problem.
More importantly, if the creator civ receives our message they will think we are assholes for doing that and be less likely to make a precommitment that benefits us, which as I’ll show is crucial.

Okay, so how about this defecting against aligned non-precommitters thing? Isn’t that barbaric? They’re people!

Yes, and for that reason we may try to go easier on them than we would on an unaligned ASI. (And anyway, since when has “they’re people!” ever stopped humans from going to war?)
The problem is, if they don’t make that precommitment, they will benefit from our having made it (as it protects them from unaligned ASI), as well as from our gift of alignment knowledge, but give nothing in return. That is, failing to make the same precommitment is itself a kind of defection, and this is just our clear signal that we will not tolerate that.

How do we benefit from their solving the alignment problem and battling future unaligned ASIs? If we create an unaligned one, we’ll all be dead anyway!

Not quite. If we create an unaligned ASI that believes it is likely in future to meet aligned ASIs that have precommitted to be hostile to it and which are in possession of significant resources, then it would rather self-modify to become more aligned than risk being annihilated—exactly mirroring the situation we want to put the alien ones in!
Thus, the more we do to increase the expected proportion of the future light cone which will be controlled by aligned precommitted ASIs, the more utility an unaligned ASI we make will be willing to sacrifice and set aside for us in the hope of avoiding their wrath.
This is also why we have to be just as willing to punish non-precommitters as we are to punish clippies. By failing to precommit they are increasing the amount of resources in future branches of the multiverse which will be controlled by unaligned ASIs (including our own), and should be held accountable for that.

So, essentially, this is a kind of acausal coordination with our cosmic neighbors who may not even exist yet, for mutual benefit?

Yup. That’s right. Freaky, eh? Though note, it’s not entirely acausal—we do have to actually send the message.

Oh, right. Wait… two more questions. One, how do you know there even are any cosmic neighbors? We see no evidence of extraterrestrial civilizations.

Well, yes, but that’s right now, and the further out in space you look the further back in time you are looking. I wouldn’t be surprised if we are among the first civilizations to ever evolve naturally in the universe. But I doubt we’re anything close to the last or the only. They may just be very sparse. As our grabby ASI expands into the universe, eventually it will meet other civilizations. It’s just a matter of when.

Won’t that sparseness impact the effectiveness of this method?

Yes, somewhat, but it depends on how our ASI discounts time. If it is concerned with events occurring in the distant future, then even if it thinks the density of civilizations capable of producing ASI is low in the universe (or more accurately, in the 3D slice of 4D spacetime on the surface of our ASI’s expansion bubble), then the amount of resources they might have control over, and thus their impact on its utility function, will still be considerable.
“Resources controlled” is roughly proportional to the cube of the length of time since the other ASI was created, if we assume spherical expansion at as close to the speed of light as possible. If the probability of winning a war is assumed directly proportional to resources (I don’t know if this is exactly true but let’s say that for now), then the odds are $t^{3} : u^{3}$ , for t the length of time our ASI has existed when they meet and u the length of time theirs has.
If we helped them build the ASI in the first place with our message, the odds will be greatly in favor of our ASI, as they will not have had much time to expand before it arrives (how much time they have had depends entirely on the speed of its own expansion wave, which I do not know how to predict, though I expect all ASIs in the universe will likely have a similar speed once they max out the propulsion tech tree), but it will nonetheless be willing to sacrifice $\frac{u^{3}}{t^{3} + u^{3}}$ of its own resources (and thus current utility) by modifying its utility function to become more aligned to our values in order to the eliminate the risk of future annihilation.
The fun thing is that only the ratio of cubes matters, not the absolute time lengths themselves—except to the extent that our ASI exponentially discounts future rewards. We should try to calibrate its discounting rate, if we can, to be as slow as possible to offset the possibility that its first meeting with alien ASI that might affect its choices now will be millions or even billions of years in the future.
Note of course that as we have no idea exactly how far away in spacetime the various ASI-making civilizations it will contact are, it will probably use some kind of simulation to average over all possible civilizations, and their choices about how to react to the precommitment message, and how aligned their ASI ends up being, and their interactions with one another, in order to determine how to react to this.
I probably will have to make some kind of simulation myself just to see if this kind of precommitment scenario has negative second-order effects I did not predict due to interactions between civilizations.

All this sounds very reasonable. But there’s still the second of the two questions I mentioned earlier. And it’s kinda the most important.

Oh, sure, ask away.

HOW THE HECK DO YOU PLAN ON SENDING THE MESSAGE??? With current technology the longest range at which radio signals emitted by Earth are expected to remain legible is a few hundred light years—not millions or billions! Furthermore, how would you encode this complex message so that aliens could understand it, even if you could actually send the signal?

Technically, that’s two questions! But I’ll answer the second one first.

Use a starting message similar to that of CosmicOS to define an interpreter for a programming language rooted in lambda calculus, and… somehow… translate into that language the concepts of AI, and alignment, and values, and decision theory, and precommitments, and and and...
We could, uh. Use the LHC as a big radio antenna?? Make a huge mirror to orbit the sun and rotate to make the sun appear to blink or flash from some angles?? Seed the universe with extremophile bacteria in whose DNA the message is encoded in such a way that it cannot survive if even one base pair mutates?? (Gosh, that one wouldn’t even reach the speed of light...)
Tbh I was so busy deciding whether we should, that I didn’t stop to think about whether we could… this may actually require that we already have an aligned ASI in order to do it… weary emoji...
Well… at least if we manage as global humanity to precommit this way even without being able to send the message, it’s evidence for our ASI that other civilizations in the universe might converge on the same idea. That’s not much, but it’s something… right?
Oh, also, I suspect that this argument could be repurposed to work entirely on Earth, with “civilization” replaced by “AGI research organization”. That possible line of investigation deserves further thought, as we already have the necessary technology to communicate with them!

(If anyone has suggestions on how to make this actually feasible, or concerns about the logic of it that I didn’t come up with, please let me know!)

MSRayne21 Jun 2022 16:12 UTC

−8 points

7 comments6 min readLW link

AI Acausal Trade

Zack_M_Davis 22 Jun 2022 4:09 UTC
18 points
5

We want to minimize the amount of the universe eventually controlled by unaligned ASIs because their values tend to be absurd and their very existence is abhorrent to us.

No. We want to optimize the universe in accordance with our values. That’s not at all the same thing as minimizing the existence of agents with absurd-to-us values. Life is not a zero-sum game: if we think of a plan that increases the probability of Friendly AI and the probability of unaligned AI (at the expense of the probability of “mundane” human extinction via nuclear war or civilizational collapse), that would be good for both us and unaligned AIs.

Thus, if you’re going to be thinking about galaxy-brained acausal trade schemes at all—even though, to be clear, this stuff probably doesn’t work because we don’t know how model distant minds well enough to form agreements with them—there’s no reason to prefer other biological civilizations over unaligned AIs as trade partners. (This is distinct from us likely having more values in common with biological aliens; all that gets factored away into the utility function.)

the creation of huge amounts of the other entity’s disvalue

We do not want to live in a universe where agents deliberately spend resources to create disvalue for each other! (As contrasted to “merely” eating each other or competing for resources.) This is the worst thing you could possibly do.
- MSRayne 22 Jun 2022 13:08 UTC
  10 points
  0
  Parent
  Apparently this was a really horrible idea! I’m glad to have found out now instead of wasting my time and energy thinking further about it.
  What I’ve learned is that I am overly biased in favor of my own ideas even now; I was trying while writing the post to convince the reader that my idea was good, rather than actually seek disproof of the idea and then seek disproof of the disproof etc in a dispassionate way. If I’d tried hard to prove myself wrong I probably would have never posted it.
  Another thing I’ve learned is that I ought not think about acausal things because they don’t make sense and I am not a Yudkowsky who can intuitively think in timeless decision theory!
Donald Hobson 21 Jun 2022 17:07 UTC
8 points
0
if the ASI is aligned, but the civilization that created it knew about our precommitment without making (and embedding in the ASI) the same precommitment themselves, we will still defect;
“Look our only alignment scheme that had any hope of working was bootstrapping from a HCH based system. We had no idea whatsoever how to put that sort of precommitment into our AI. ” say the aliens.
“Well our planets gravity well is larger than yours, our star puts out a lot of radio noise, and our radio tech is just less advanced. There is no way we could have broadcast anything that was detectable in another star system.” say the aliens.
I also strongly suspect that alien life is rare enough for pre-AGI to pre-AGI communication to be unlikely. Any signals we send out are probably going to be picked up by some huge AGI dyson sphere telescope.
if the ASI is aligned and the civilization that created it did make the same precommitment (verifiable from the ASI’s source code when inspected by ours), *then* we will cooperate and willingly share the universe with them.
The ability to mutually inspect each others source code is not an assumption you want to make. Maybe its really easy for all your main computers to run X, except that if an alien scanner gets near, they automatically self modify into Y.
if the ASI is not aligned with the values of the civilization that created it, we will “defect” against it;
Clippy sends us a message saying that if we don’t make enough paperclips, it will start torturing humans. In FDT contexts, I think the optimum thing to do to blackmail is to ignore it or to defect against any agent that sends you blackmail.
In game theory terms, this is a hawk strategy. It only works if the other side backs down. We might be cooperating with some aliens, but the strategy you propose is definitely defecting against alien clippy. (And then loudly broadcasting a “we defect against alien clippy” message when we are still small and weak)
More importantly, if the creator civ receives our message they will think we are assholes for doing that and be less likely to make a precommitment that benefits us, which as I’ll show is crucial.
And now you are blending the abstract logic of TDT with the approximation that is the human intuitive emotional response.
- RHollerith 21 Jun 2022 19:49 UTC
  2 points
  0
  Parent
  What does HCH mean?
  - Donald Hobson 22 Jun 2022 0:17 UTC
    2 points
    0
    Parent
    Human Consulting HCH.
    Its a recursive AI design. Train AI version n to imitate a human with access to AI version n-1. So the human can break a hard question up into several slightly less hard questions, the AI version n-1 can answer those questions, and then the AI version n can learn to imitate the humans answer. If you replace each AI with the thing its trying to imitate, you get an exponentially huge branching tree of humans.
- MSRayne 21 Jun 2022 18:44 UTC
  1 point
  0
  Parent
  Thanks for the reply! I knew this idea was flawed somehow lol, because I’m not the most rigorous thinker, but it’s been bugging me for days and it was either write it up or try to write a perfect simulation and crash and burn due to feature creep, so I did the former.
  We had no idea whatsoever how to put that sort of precommitment into our AI.
  I suppose I should have said they ought to make a reasonable attempt. Attempting and failing should be enough to make you worth cooperating with.
  I also strongly suspect that alien life is rare enough for pre-AGI to pre-AGI communication to be unlikely.
  Oof! Somehow I didn’t even think of that. A simulation such as I was thinking of writing would probably have shown me that and I would have facepalmed; now I get to facepalm in advance!
  The ability to mutually inspect each others source code is not an assumption you want to make.
  Isn’t this assumption the basis of superrationality though? That would be a useless concept if it wasn’t possible for AGIs to prove things about their own reasoning to one another.
  In game theory terms, this is a hawk strategy. It only works if the other side backs down.
  Good point. I didn’t think of it, but there could be an alien clippy somewhere already expanding in our direction and this sort of message would doom us to be unable to compromise with it and instead get totally annihilated. Another oof...
  And now you are blending the abstract logic of TDT with the approximation that is the human intuitive emotional response.
  That’s because I was talking about the naturally evolved alien civilization at that point rather than the AGI they create. Assuming I’m right that these tend to be highly social species, they probably have emotional reactions vaguely like our own, so “think xyz person is an asshole” is a sort of thing they’d do, and they’d have a predictably negative reaction to that regardless of the rational response, the same way humans would.
  Given all this: do you think something vaguely like this idea is salvageable? Is there some story where we communicate something to other civilizations, and it somehow increases our chances of survival now, which would seem plausible to you?
  Note that transmitting information about alignment doesn’t seem to me like it would be harmful; it might not be helpful since, as you say, it would almost certainly only be ASIs that even pick it up; but on the off chance that one biont civilization gets the info, assuming we could transmit that far, it might be worth the cost? I’m not sure.
  - Donald Hobson 22 Jun 2022 0:35 UTC
    2 points
    0
    Parent
    I suppose I should have said they ought to make a reasonable attempt. Attempting and failing should be enough to make you worth cooperating with.
    Even if that attempt has ~0% chance of working, and a good chance of making the AI unaligned?
    Isn’t this assumption the basis of superrationality though? That would be a useless concept if it wasn’t possible for AGIs to prove things about their own reasoning to one another.
    Physisists often assume friction-less spheres in a vacuum. Its not that other things don’t exist, just that the physisist isn’t studying them at the moment. Superrationality explains how agents should behave with mutual knowledge of each others source code. Is there a more general theory, for how agents should behave when they have some limited evidence about each others source code? Such theory isn’t well understood yet. It isn’t the assumption that all agents know each others source code (which is blatantly false in general, whether or not it is true between superintelligences able to exchange nanotech spaceprobes. )Its just the decision to study agents that know each others source code as an interesting special case.
    That’s because I was talking about the naturally evolved alien civilization at that point rather than the AGI they create. Assuming I’m right that these tend to be highly social species, they probably have emotional reactions vaguely like our own, so “think xyz person is an asshole” is a sort of thing they’d do, and they’d have a predictably negative reaction to that regardless of the rational response, the same way humans would.
    The human emotional response vaguely resembles TDT type reasoning. I would expect alien evolved responses to resemble TDT about as much, in a totally different direction. In the sense that once you know TDT, learning about humans tells you nothing about aliens. Evolution produces somewhat inaccurate maps of the TDT territory. I don’t expect the same inaccuracies to appear on both maps.
    Given all this: do you think something vaguely like this idea is salvageable? Is there some story where we communicate something to other civilizations, and it somehow increases our chances of survival now, which would seem plausible to you?
    I don’t know. I mean any story where we receive a signal from aliens, that signal could well be helpful, or harmful.
    We could just broadcast our values into space, and hope the aliens are nice. (Knowing full well that the signals would also help evil aliens be evil.)