Rohin Shah comments on The strategy-stealing assumption

Rohin Shah 21 Sep 2019 4:58 UTC
LW: 4 AF: 2
0
AF
Planned summary:
We often talk about aligning AIs in a way that is _competitive_ with unaligned AIs. However, you might think that we need them to be _better_: after all, unaligned AIs only have to pursue one particular goal, whereas aligned AIs have to deal with the fact that we don’t yet know what we want. We might hope that regardless of what goal the unaligned AI has, any strategy it uses to achieve that goal can be turned into a strategy for acquiring _flexible_ influence (i.e. influence useful for many goals). In that case, **as long as we control a majority of resources**, we can use any strategies that the unaligned AIs can use. For example, if we control 99% of the resources and unaligned AI controls 1%, then at the very least we can split up into 99 “coalitions” that each control 1% of resources and use the same strategy as the unaligned AI to acquire flexible influence, and this should lead to us obtaining 99% of the resources in expectation. In practice, we could do even better, e.g. by coordinating to shut down any unaligned AI systems.
The premise that we can use the same strategy as the unaligned AI, despite the fact that we need _flexible_ influence, is called the **strategy-stealing assumption**. Solving the alignment problem is critical to strategy-stealing—otherwise, unaligned AI would have an advantage at thinking that we could not steal and the strategy-stealing assumption would break down. This post discusses **ten other ways that the strategy-stealing assumption could fail**. For example, the unaligned AI could pursue a strategy that involves threatening to kill humans, and we might not be able to use a similar strategy in response because the unaligned AI might not be as fragile as we are.
Planned opinion:
It does seem to me that if we’re in a situation where we have solved the alignment problem, we control 99% of resources, and we aren’t infighting amongst each other, we will likely continue to control at least 99% of the resources in the future. I’m a little confused about how we get to this situation though—the scenarios I usually worry about are the ones in which we fail to solve the alignment problem, but still deploy unaligned AIs, and in these scenarios I’d expect unaligned AIs to get the majority of the resources. I suppose in a multipolar setting with continuous takeoff, if we have mostly solved the alignment problem but still accidentally create unaligned AIs (or some malicious actors create it deliberately), then this setting where we control 99% of the resources could arise.