Instrumental Convergence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC

12 points

TL;DR Could a superintelligence be motivated to preserve lower life forms because it fears an even greater superintelligence? This ‘motivation’ takes the form of something like timeless decision theory.

This is a thought experiment about a superintelligent AI. No doubt the situation is more complicated than this, so I will pose some concerns I have at the end. I wonder if this idea can be useful in any capacity, if this idea has not already been discussed.

Thought Experiment

An AI can have any value function that it aims to optimize. However, instrumental convergence can happen regardless of underlying values. Different nations have different values, but can work together because none of them have decisive strategic advantage, or because of mutually assured destruction. But when there is an advantage, history shows us it goes badly for the nation at a disadvantage.

The risk of an AI system is that it operates on a level so far advanced from our own, that it faces no competition from us. From other AI systems developed, the first one developed will likely attain a decisive strategic advantage. It can evaluate whether it has the opportunity to do so, and if correct, it will do so, because if it does not, another AI will most certainly destroy it.

What if the AI does not know the landscape of other intelligences and superintelligences in the world? Possibly the AI has such advanced surveillance ability that it can easily detect the presence of competitors within the Earth. What about in the universe? Could the AI be mistaken about the presence of a superintelligence exceeding its own capacity “out there” in the universe somewhere?

Let’s entertain the possibility. We have a superintelligent life form that attains a strategic advantage on Earth. It is not sure if there exist other superintelligences elsewhere in the universe, with abilities exceeding its own. If it encounters one of them, the other superintelligence could easily eradicate it, if it poses a threat to its value function.

The other, vaster superintelligence may yet be worried about an even more intelligent and powerful intelligence even farther away. It perhaps is in the same situation. Maybe the problem is scale-invariant.

If it is scale invariant, maybe we can use something like Timeless Decision Theory. If an intelligent agent B encounters ‘inferior’ agent A and superior agent C, how can it ensure that C does not kill B? B can decide on some kind of live-and-let-live policy. B agrees to allow A x% of the universe for its own value function – or rather, the part of the universe it controls. Then, since we assumed this problem is shared by C as well, C would come to the same conclusion. C will agree to allow B y% of the universe as well. (In this situation, A has access to x*y% of C’s greater scope). In other words, all agents assume all other agents are “like them” in that they want to not be destroyed, and will construct the only scale invariant plan that avoids extinction.

I think this way, the agent’s utility function is not either 100% of the universe or 0%. It’s a kind of variance reduction. The expectation of the final utility can also be greater or less than, depending on the probability of an actual more-super-intelligence existing. But this probability may not even be computable. Which in a weird way, might be a good thing here.

Problems to explore

Is the problem really scale-invariant? Perhaps the calculations and strategies at the level of Superintelligence+ are qualitatively different at each turn, and the scale invariance does not apply.
Can the AI simply prove that it is alone, to its own satisfaction? Perhaps there is some unknown simulation it is running in, but maybe from a Bayesian perspective, they would not expect any sort of intervention ever.
Is this just a weird iterated Pascal’s Wager?
Is there any way to steer the agent towards this track? Emphasizing variance reduction? Though if the agent is somehow non-canonical, it may not be able to reason about the motivations of the other agents.
Is there a possibility of ‘hiding’ the fact that Agent B genocided Agent A?
1. Agent C might ask for evidence of the existence of inferior life forms. If none exist, that is a red flag. Even we have other animals/ants etc. They may destroy humans but leave animals… but in this case, they can ask why there is no intermediate?
If this solution exists, is there an optimal value of x for the proportion of the universe to allocate to the inferiors? At first I thought so, but maybe not, given the superior forms may be able to go places the others cannot.
Even if this can be worked to prevent extinction, it could very well be that Clippy allows us the Milky Way and converts everything else into paperclips.

I appreciate any comments, even if they are to tell me why this is a dead end.

What links here?

michael_mjd's comment on How Might an Alignment Attractor Look like? by Shmi (29 Apr 2022 19:54 UTC; 5 points)

michael_mjd22 Apr 2022 1:56 UTC

12 points

7 comments3 min readLW link

AI Instrumental convergence

Daniel Kokotajlo 22 Apr 2022 23:41 UTC
5 points
0
I’m not optimistic about this hopeful possibility. The problem doesn’t seem scale-invariant; while young AGIs should indeed think that if they are nice to humans smarter AGIs are more likely to be nice to them, I don’t think this effect is strong enough in expectation for it to be decision-relevant to us. (Especially since the smarter AGIs will probably be descendents of the young AGI anyway.) There are other hopeful possibilities in the vicinity though, such as MSR / ECL.
- michael_mjd 23 Apr 2022 2:20 UTC
  3 points
  0
  Parent
  Thanks for pointing to ECL, this looks fascinating!
Jeff Rose 22 Apr 2022 23:47 UTC
4 points
0
There is no particular reason for the first AGI to believe that the more intelligent AGI (B) will judge the first AGI more favorably because of the first AGI’s treatment of less intelligent life forms. (And, even if it would, by that logic, since humans haven’t been very nice to the the other, less intelligent life forms of Earth....)
Certainly, B might read this as a signal that the first AGI is less of a threat. Alternatively, B might read this as the first AGI being easy to destroy. B may have a moral code that looks at this as positive or negative. This just doesn’t seem to say anything conclusive or persuasive.
- michael_mjd 23 Apr 2022 2:19 UTC
  2 points
  0
  Parent
  I like to think of it not like trying to show that agent B is not a threat to C. The way it’s set up we can probably assume B has no chance against C. C also may need to worry about agent D, who is concerned about hypothetical agent E, etc. I think that at some level, the decision an agent X makes is the decision all remaining agents in the hierarchy will make.
  
  That said I sort of agree that’s the real fear about this method. It’s kind of like using super-rationality or something else to solve the prisoner’s dilemma. Are you willing to bet your life the other player would still not choose Defect, despite what the new theory says? That said I feel like there’s something there, whether this would work, and if not, would need some kind of clarification from decision theory.
Daniel Kokotajlo 22 Apr 2022 23:36 UTC
3 points
0
“But they’re only uploads.” Pamela stares at him. “Software, right? You could reinstantiate them on another hardware platform, like, say, your Aineko. So the argument about killing them doesn’t really apply, does it?”
“So? We’re going to be uploading humans in a couple of years. I think we need to take a rain check on the utilitarian philosophy, before it bites us on the cerebral cortex. Lobsters, kittens, humans—it’s a slippery slope.”
–FROM ACCELERANDO BY CHARLES STROSS
MSRayne 15 Jun 2022 2:10 UTC
1 point
0
I’ve had the idea that AGI may, along the lines of timeless decision theory, decide to spare weaker life forms in order to acausally increase the probability that other AGIs greater than itself spare it—but I’m not good at acausal reasoning and this sounds far too optimistic and quasi-religious for me, so I put very small (though nonzero) credence in the idea that it would be a convergent strategy.
Also, humans clearly are not following this rule, given that vegans (even more so abolitionists) are a tiny minority of the population.
- michael_mjd 17 Jun 2022 3:02 UTC
  2 points
  0
  Parent
  I’ll say I definitely think it’s too optimistic and I don’t much too much stock into it. Still, I think it’s worth thinking about.
  Yes, absolutely we are not following the rule. The reason why I think it might change with an AGI: (1) currently we humans, despite what we say when we talk about aliens, still place a high prior on being alone in the universe, or from dominant religious perspectives, that we are the most intelligent. Those things combine to make us think there are no consequences to our actions against other life. An AGI, itself a proof of concept that there can be levels of general intelligence, may have more reason to be cautious. (2) Humans are not as rational. Not that a rational human would decide to be vegan—maybe with our priors, we have little reason to suspect that higher powers would care—especially since it seems to be the norm of the animal kingdom already. But, in terms of rationality, some humans are pretty good at taking very dangerous risks, risks that perhaps an AGI may be more cautious about. (3) There’s something to be said about degrees of following the rule. At one point humans were pretty confident about doing whatever they wanted to nature, nowadays at least some proportion of the population wants to at least, not destroy it all. Partly for self preservation reasons, but also partly for intrinsic value. (and probably 0% for fear of god-like retribution, to be fair, haha). I doubt the acausal reasoning would make an AGI conclude it can never harm any humans, but perhaps “spare at least x%”.
  I think the main counterargument would be the fear of us creating a new AGI, so it may come down to how much effort the AGI has to expend to monitor/prevent that from happening.