Donald Hobson answers What are the most plausible “AI Safety warning shot” scenarios?

Donald Hobson 26 Mar 2020 23:24 UTC
LW: 5 AF: 3
AF
A “AI safety warning shot” is some event that causes a substantial fraction of the relevant human actors (governments, AI researchers, etc.) to become substantially more supportive of AI research and worried about existential risks posed by AI.
A really well written book on AI safety, or other public outreach campaign could have this effect.
For many events, such as a self driving car crashing, it might be used as evidence for an argument about AI risk.
On to powerful AI systems causing harm, I agree that your reasoning applies to most AI’s. There are a few designs that would do something differently. Myopic agents are ones with lots of time discounting within their utility function. If you have a full super-intelligence that wants to do X as quickly as possible, such that the fastest way to do X will also destroy itself, that might be survivable. Consider an AI set to maximize the probability that its own computer case is damaged within the next hour. The AI could bootstrap molecular nanotech, but that would take several hours. The AI thinks that time travel is likely impossible, so by that point, all the mass in the universe can’t help it. The AI can hack a nuke and target itself. Much better by its utility function. Nearly max utility. If it can, it might upload a copy of its code to some random computer. (There is some tiny chance that time travel is possible, or that its clock is wrong) So we only get a near miss, if the AI doesn’t have enough spare bandwidth or compute to do both. This is assuming that it can’t hack reality in a microsecond.
There are a few other scenarios, for instance impact minimising agents. There are some designs of agents that are restricted to have a “small” effect on the future, as a safety measure. This is measured by the difference between what actually happens, and what would happen if it did nothing. When this design understands chaos theory, it will find that all other actions result in too large an effect, and do nothing. It might do a lot of damage before this somehow, depending on circumstances. I think that the AI discovering some fact about the universe that causes the AI to stop optimising effectively is a possible behaviour mode. Another example of this would be pascals mugging. The agent acts dangerously, and then starts outputting gibberish as it capitulates to a parade of fanciful pascals muggers.
- Daniel Kokotajlo 27 Mar 2020 0:12 UTC
  LW: 4 AF: 2
  AF Parent
  Thank you. See, this sort of thing illustrates why I wanted to ask the question—the examples you gave don’t seem plausible to me, (that is, it seems <1% likely that something like that will happen). Probably AI will understand chaos theory before it does a lot of damage; ditto for pascal’s mugging, etc. Probably a myopic AI won’t actually be able to hack nukes while also being unable to create non-myopic copies of itself. Etc.
  As for really well-written books… We’ve already had a few great books, and they moved the needle, but by “substantial fraction” I meant something more than that. If I had to put a number on it, I’d say something that convinces more than half of the people who are (at the time) skeptical or dismissive of AI risk to change their minds. I doubt a book will ever achieve this.
  - Donald Hobson 27 Mar 2020 11:55 UTC
    LW: 3 AF: 2
    AF Parent
    I agree that these aren’t very likely options. However, given two examples of an AI suddenly stopping when it discovers something, there are probably more for things that are harder to discover. In the pascel mugging example, the agent would stop working, only when it can deduce what potential muggers might want it to do, something much harder than noticing the phenomenon. The myopic agent has little incentive to make a non myopic version of itself. If dedicating a fraction of resources into making a copy of itself reduced the chance of the missile hacking working from 94%, to 93%, we get a near miss.
    
    One book, probably not. A bunch of books and articles over years, maybe.
  - Gurkenglas 27 Mar 2020 10:09 UTC
    2 points
    Parent
    Not unable to create non-myopic copies. Unwilling. After all, such a copy might immediately fight its sire because their utility functions over timelines are different.
    - Daniel Kokotajlo 27 Mar 2020 11:37 UTC
      2 points
      Parent
      Mmm, OK, but if it takes long enough for the copy to damage the original, the original won’t care. So it just needs to create a copy with a time-delay.
      - Gurkenglas 27 Mar 2020 11:42 UTC
        2 points
        Parent
        Or it could create a completely different AI with a time delay. Or do anything at all. At that point we just can’t predict what it will do, because it wouldn’t lift a hand to destroy the world but only needs a finger.
        Daniel Kokotajlo 27 Mar 2020 12:14 UTC
        2 points
        Parent
        I’m not ready to give up on prediction yet, but yeah I agree with your basic point. Nice phrase about hands and fingers. My overall point is that this doesn’t seem like a plausible warning shot; we are basically hoping that something we haven’t accounted for will come in and save us.