[Question] Has there been any work on attempting to use Pascal’s Mugging to make an AGI behave?

A mere line in the sand, backed by the clout of a nonexistent simulator, could prove a stronger restraint than a two-foot-thick steel door—Nick Bostrom, Superintelligence


Nick Bostrom’s idea of anthropic capture is very similar to the idea of utilising Pascal’s Mugging, but he doesn’t explore this in detail. Further, there is a difference in that framing this in terms of Pascal’s Mugging suggests that this could work even if the AGI is highly skeptical of the mugging and defines a minuscule probability of it being true. Off the top of my head, potential lines of exploration include a) biasing the AI’s reasoning to allow it to be mugged[1] b) trying to figure out what would be the optional message to send the AGI on the assumption that it would surrender to the mugging c) including scenarios where an AI allows itself to be mugged in the training data.

  1. ^

    With the obvious downside that others may use this to mug the AI.