Tactics against Pascal’s Mugging

This is meant as a rough collection of five ideas of mine on potential anti-Pascal Mugging tactics. I don’t have much hope that the first three will be any useful at all and am afraid that I’m not mathematically-inclined enough to know if the last two are any good even as a partial solution towards the core problem of Pascal’s Mugging—so I’d appreciate if people with better mathematical credentials than mine could see if any of my intuitions could be formalizable in a useful manner.

0. Introducing the problem (this may bore you if you’re aware of both the original and the mugger-less form of Pascal’s Mugging)

First of all the basics: Pascal’s Mugging in its original form is described in the following way:

  • Now suppose someone comes to me and says, “Give me five dollars, or I’ll use my magic powers from outside the Matrix to run a Turing machine that simulates and kills 3^^^3 people.”

This is the “shallow” form of Pascal’s mugging, which includes a person that (almost certainly) is attempting to deceive the prospective AI. However let’s introduce some further statements similar to the above, to avoid particular objections that might be used in some (even shallower) attempted rebuttals:

  • “Give me five dollars, and I’ll use my magic powers from outside the Matrix to increase the utility of every human being by 3^^^^3 utilons” (a supposedly positive trade rather than a blackmailer’s threat)

  • “I’m an alien in disguise—unless you publicly proclaim allegiance to your insect overlords, we will destroy you then torture all humanity for 3^^^^3 years” (a prankster asks for something which might be useful to an actual alien, but on a material-level not useful to a human liar)

  • “My consciousness has partially time-travelled from the future into the past, and one of the few tidbits I remember is that it would be of effectively infinite utility if you asked everyone to call you Princess Tutu.” (no trade offered at all, seemingly just a statement of epistemic belief)

  • Says the Devil “It’s infinitely bad to end that song and dance
    And I won’t tell you why, and I probably lie, but can you really take that chance?”
    Blaise fills with trepidation as his calculations all turn out the devil’s way.
    And they say in the Paris catacombs, his ghost is fiddlin’ to this day.

I think these are all trivial variations of this basic version of Pascal’s Mugging: The utility a prankster derives from the pleasure of successfully pranking the AI wouldn’t be treated differently in kind to the utility of 5 dollars—nor is the explicit offer of a trade different than the supposedly free offer of information.

The mugger-less version is on the other hand more interesting and more problematic. You don’t actually need a person to make such a statement—the AI, without any prompting, can assign prior probabilities to theories which produce outcomes of positive or negative value vastly greater than their assigned improbabilities. I’ve seen its best description in the comment by Kindly and the corresponding response by Eliezer:

Kindly: Very many hypotheses—arguably infinitely many—can be formed about how the world works. In particular, some of these hypotheses imply that by doing something counter-intuitive in following those hypothesis, you get ridiculously awesome outcomes. For example, even in advance of me posting this comment, you could form the hypothesis “if I send Kindly $5 by Paypal, he or she will refrain from torturing 3^^^3 people in the matrix and instead give them candy.” Now, usually all such hypotheses are low-probability and that decreases the expected benefit from performing these counter-intuitive actions. But how can you show that in all cases this expected benefit is sufficiently low to justify ignoring it?

Eliezer Yudkowsky: Right, this is the real core of Pascal’s Mugging [...]. For aggregative utility functions over a model of the environment which e.g. treat all sentient beings (or all paperclips) as having equal value without diminishing marginal returns, and all epistemic models which induce simplicity-weighted explanations of sensory experience, all decisions will be dominated by tiny variances in the probability of extremely unlikely hypotheses because the “model size” of a hypothesis can grow Busy-Beaver faster than its Kolmogorov complexity.

The following list five ideas of mine, ordered as least-to-most-promising in the search for a general solution. Though I considered them seriously initially, I no longer really think that (1) (2) or (3) hold any promise, being limited, overly specific or even plain false—I nonetheless list them for completeness’ sake, to get them out of my head and in case anyone sees something in them that could potentially be the seed of something better. I’m slightly more hopeful for solutions (4) or (5) -- they feel to me intuitively as if they may be leading to something good. But I’d need math that I don’t really have to prove or disprove it.

1. The James T. Kirk solution

To cut to the punchline: the James T. Kirk solution to Pascal’s Mugging is “What does God need with a starship?”

Say there’s a given prior possibility P(X=Matrix Lord) that any given human being is a Matrix Lord with the power to inflict 3^^^3 points of utility/​disutility. The fact that such a being with such vast power seemingly wants five dollars (or a million dollars, or to be crowned Queen of Australia), makes it actually *less* likely that such a being is a Matrix Lord.

We don’t actually need the vast unlikely probabilities to illustrate the truth of this. Let’s consider an AI with a security backdoor—it’s known for a fact than there’s one person in the world which has been given a 10-word passkey that can destroy the AI at his will. (The AI is also disallowed from attempting to avoid such penalty by e.g. killing the person in question).

So let’s say the prior probability for any given person being the key keeper in question is “1 in 7 billion”

Now Person X says to the AI. “Hey, I’m the key keeper. I refuse to give you any evidence to the same, but I’ll destroy you if you don’t give me 20 dollars.”

Does this make Person X more or less likely to be the key keeper? My own intuition tells me “less likely”.

Unfortunately, one fundamental problem with the above syllogism is that at best it can tell us that it’s only the muggerless version that we need fear. Useless for any serious purpose.

2. The presumption of unfriendliness

Robot Devil
This is most obviously the method that, in the examples above, poor Blaise should have used to defeat the devil’s claim of infinite badness. In a universe where ending ‘that song and dance’ can be X-bad, the statement should also be considered that it could be X-bad to NOT end it, or indeed X-good to do end it. The devil (being a known malicious entity) is much more likely to push Pascal towards doing what would result to the infinite badness. And indeed in the fictional evidence provided by the song, that’s indeed what the devil achieves: to harm Blaise Pascal and his ghost for an arbitrarily long time—by warning against the same and using Pascal’s calculations against him.

Blaise’s tactic should have been not to obey the devil’s warning, nor to even do the opposite than his suggestion (since the devil could be smart enough to know how to use reverse psychology), but rather to ignore him as much as possible: Blaise should end the song and dance at the point in time he would have done if he wasn’t aware of the devil’s statement.

All the above is obvious for cartoonish villains like the devil—known malicious agents who are known to have a utility function opposed to ours—and a Matrix Lord who is willing to torture 3^^^3 people for the purpose of getting 5 dollars is probably no better; better to just ignore them. But I wonder: Can’t a similar logic be used in handling most any agents with utility functions that are merely different than one’s own (which is the vast number of agents in mindspace)?

Moreover a thought that occurs: Doesn’t it seem likely that for any supposed impact X, the greater the promised X, the less likely two different minds are both positively inclined towards it? So for any supposed impact X, shouldn’t the presumption of unfriendliness (incompatibility in utility functions) increase in like measure ?

3. The Xenios Zeus.

Pyramus and Thisbe
This idea was inspired from the old myth about Zeus and Hermes walking around pretending to be travellers in need, to examine which people were hospitable and which were not. I think there may exist similar myths about other gods in other mythologies.

Let’s say that each current resident has a small chance (not necessarily the same small chance) of being a Matrix Lord willing to destroy the world and throw a temper tantrum that’ll punish 3^^^3 people if you don’t behave properly according to what he considers proper. Much like each traveller has a chance of being Zeus.

One might think that you might have to examine the data very closely to figure out which random person has the greatest probability of being Zeus—but that rather fails to get the moral of the myth, which isn’t “figure out who is secretly Zeus” but rather “treat everyone nicely, just in case”. If someone does not reveal themselves to be a god, then they don’t expect to be treated like a god, but might still expect human decency.

To put it in LW analogous terms one might argue that an AI could treat the value system of even Matrix Lords as roughly centered around the value system of human beings—so that by serving the CEV of humanity, it would also have the maximum chance of pleasing (or at least not angering) any Matrix Lords in question.

Unfortunately in retrospect I think this idea of mine is, frankly, crap. Not only is it overly specific and again seems to treat the surface problem rather than the core problem, but I realized it reached the same conclusion as (2) by asserting the exact opposite—the previous idea made an assumption of unfriendliness, this one makes an assumption of minds being centered around friendliness. If I’m using two contradictory ideas to lead to the same conclusion, it probably indicates that this is a result of having written the bottom line—not of making an actually useful argument.

So not much hope remains in me for solutions 1-3. Let’s go to 4.

4. The WWBBD Principle. (What Would a Boltzman Brain Do?)

If you believed with 99% certainty that you were a Boltzman Brain, what should you do? The smart thing would probably be: Whatever you would do if you weren’t a Boltzman Brain. You can dismiss the hypotheses where you have no control over the future; because it’s only the ones where you have control that matter on a decision-theoretic basis.

Calculations of future utility have a discounting factor naturally built into them—which is the uncertainty of being able to calculate and control such a future properly. So in a very natural (no need to program it in) manner an AI would prefer the same utility for 5 seconds in the future rather than 5 minutes in the future, and for 5 minutes in the future rather than 5 years in the future.

This looks at first glance as a time-discount, but in actuality it’s an uncertainty-discount. So an AI that had a very good predictive capacity would be able to discount future utility less; because the uncertainty would be less. But the uncertainty would never be quite zero.

One thing that may be missed in the above is that there exists not only an uncertainty in reaching a certain future state of the world, but there’s an uncertainty in how it would affect all consequent future states as well. And the greater the impact the greater such uncertainty for the future must be.

So even as the thought of 3^^^3 lives outweighs the tiny probability; couldn’t it be that a similar factor punishes it to an opposite direction, especially when dealing with hypotheses in which the AI will be able to have no further control? I don’t know. Bring in the mathematicians.

5. The Law of Visible Impact (a.k.a. The Generalized Hanson)

Robin Hanson’s suggested solution to Pascal’s Mugging has been the penalization of “the prior probability of hypotheses which argue that we are in a surprisingly unique position to affect large numbers of other people who cannot symmetrically affect us.”

I have to say that I find this argument unappealing and unconvincing. One problem I have with it is that it seems to treat the concept of “person” as ontologically fundamental—it’s an objection I kinda have also against the Simulation argument and the Doomsday Argument.

Moreover wouldn’t this argument cease to apply if I was merely witnessing the Pascal’s mugging taking place, and that therefore if I was merely witnessing I should be hoping for the mugged entity to submit? This sounds nonsensical.

But I think Hanson’s argument can be modified so here I’d like to offer what I’ll call the Generalized Hanson: Penalize the prior probability of hypotheses which argue for the existence of high impact events whose consequences nonetheless remain unobserved.

If life’s creation is easy, why aren’t we seeing alien civilizations consuming the stars? Therefore most likely life’s creation isn’t easy at all.

If the universe allowed easy time-travel, where are all the time-travellers? Hence the world most likely doesn’t allow easy time-travel.

If Matrix Lords exist that are apt to create 3^^^3 people and torture them for their amusement, why aren’t we being tortured (or witnessing such torture) right now? Therefore most likely such Matrix Lords are rare enough to nullify their impact.

In short the higher the impact of a hypothetical event, the more evidence we should be expecting to see for it in the surrounding universe—the non-visibility of such provides therefore evidence against the hypothesis analogous to the extent of such hypothetical impact.

I’m probably expressing the above intuition quite badly, but again: I hope someone with actual mathematical skills can take the above and make it into something useful; or tell me that it’s not useful as an anti-Pascal Mugging tactic at all.