johnswentworth answers What are the most plausible “AI Safety warning shot” scenarios?

johnswentworth 26 Mar 2020 23:38 UTC
LW: 9 AF: 4
AF
One of the basic problems in the embedded agency sequence is: how does an agent recognize its own physical instantiation in the world, and avoid e.g. dropping a big rock on the machine it’s running on? One could imagine an AI with enough optimization power to be dangerous, which gets out of hand but then drops a metaphorical rock on its own head—i.e. it doesn’t realize that destroying a particular data center will shut itself down.
Similarly, one could imagine an AI which tries to take over the world, but doesn’t realize that unplugging the machine on which it’s running will shut it down—because it doesn’t model itself as embedded in the world. (For similar reasons, such an AI might not see any reason to create backups of itself.)
Another possible safety valve: one could imagine an AI which tries to wirehead, but its operators put a lot of barriers in place to prevent it from doing so. The AI seizes whatever resources it needs to metaphorically smash those barriers, does so violently, then wireheads itself and just sits around.
Generalizing these two scenarios: I think it’s plausible that unprincipled AI architectures tend to have built in safety valves—they’ll tend to shoot themselves in the foot if they’re able to do so. That’s definitely not something I’d want to bet the future of the human species on, but it is a class of scenarios which would allow for an AI to deal a lot of damage while still failing to take over.
- Daniel Kokotajlo 27 Mar 2020 0:17 UTC
  4 points
  Parent
  Thanks. Hmm, I guess these still don’t seem that plausible to me. What is your credence that something in the category you describe will happen, and count as a warning shot?
  (It’s possible that an AI might shoot itself in the foot, but before it does anything super scary, such that it doesn’t have the warning shot effect.)
  Note my edit to the original question about the meaning of “substantial.”
  - johnswentworth 27 Mar 2020 0:34 UTC
    4 points
    Parent
    I’d give it something in the 2%-10% range. Definitely not likely.
- romeostevensit 27 Mar 2020 10:13 UTC
  2 points
  Parent
  Tangent but self-locating uncertainty is an interesting angle on human ethics as well.