Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 29 Mar 2021 19:11 UTC
LW: 3 AF: 1
AF
More observations about this attack vector (“attack from counterfactuals”). I focus on “amplifying by subjective time”.
- The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle^[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier.
- Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system.
- Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence.
1. ↩︎
  More precisely, it is somewhat better than this since, if we are at some late cycle among a sequence of cycles anchored to same point in objective time, the simulated user can use knowledge generated in previous cycles to improve the defense system during the current cycle. So, early cycles contribute more to the risk.