Adversarial Priors: Not Paying People to Lie to You

Reply to Desiderata for an Adversarial Prior

Assertion: An Ideal Agent never pays people to lie to them.

This seems sensible, only a very foolish person would knowingly incentivise dishonesty in others, but what does it actually mean in practice?

  1. You can’t use unverifiable information obtained from a single person or from a faction of possibly-conspiring people in any way that benefits that person or faction in the hypothetical where the information is false. Otherwise, they’re incentivised to give you the unverifiable and false information to motivate you to do that, and so you’d be paying them to lie to you.

  2. You can’t use any information, even verifiable information, obtained from a single person or from a faction of actually-conspiring people in any way that harms that person or faction in the hypothetical where it is true. Otherwise they’d just not tell you, and you’d be paying them to dishonestly shut up.

If everyone followed (2), you could freely go around saying the truth no matter what and expect no personal negative consequences. This would maximise public knowledge (your information can still be used to the benefit or detriment of other people), and people would be better off:

  • Criminals would confess their crimes in detail, in exchange for payment in the form of a reduced sentence if they do get caught so as not to actually incentivise crime. This information can then be used to help protect against, catch, or convict other criminals, it just can’t be used against the specific person who confessed it in the first place.

  • To the extent that criminals want to form groups that can work together, they can formally notify the police of their criminal conspiracies and then any information from a member can’t be used against the rest (unless a member formally betrays the conspiracy, in which case their testimony is fair game again).

If everyone followed (1), lying would be pointless because even though everyone believes you, they’ll never believe you in a way that corresponds to doing something that benefits you. This condition imposes a lot of much stranger outcomes:

  • Claims made by salespeople must be ignored unless you’ve got a way to actually test the claim, and if you can only test it in the future (after you’ve already bought the product), you’re honor-bound to return it and demand a refund if anything claimed about the product (that your decision to buy it depended on) turns out to be false.

  • School fire-alarms pulled by an unknown person who possibly wants school to be interrupted must be ignored, unless you’ve got a punishment system set up with sufficient probability of figuring out who did it and hurting them enough that they aren’t better off pulling the alarm. If some people really want school to be interrupted, it’s sufficiently hard to figure out who did it, or you’re unwilling to torture children, this means that school fire-alarms are basically always ignored. The odds that someone is lying vs. there is actually a fire as immaterial.

  • Pascal’s mugger is ignored regardless of your prior probability he’s telling the truth.

  • Unless you can personally verify enough of the theory about AI alignment being hard, and what might be done about that, you must refuse to donate to protect against the possibly conspiracy of all AI safety researchers working together to pretend a problem exists to get donations. If that example is from the wrong tribe, consider living in medieval Europe and wondering whether there really is an afterlife or every single theologian is conspiring to trick you into donating to the church.

  • Unless you can personally verify that the entire population of the world is not conspiring against you to trick you about the fair prices of services, and therefore extract valuable labour from you, then you should refuse to ever do anything that might benefit anyone else.

This seems particularly terrible, and the whole “refusing to be exploitable regardless of prior probability” seems like a step way too far. It’s the kind of logic that leads to saying “since the murderer won’t confess, and I want the murderer to be executed, we’ll just have to execute everyone to make sure he didn’t benefit by not confessing”. That’s a lot of utility you’re destroying just to avoid ever paying people to lie to you.

If we consider the two piles of utility:

  • Value I expect to lose by refusing to ever trust anyone, even though the worst most adversarial situations I can think of are almost never actually the case.

  • Value I expect to lose by failing to be completely inexploitable at all times, and therefore at least sometimes being exploited.

It would seem like there’s some ideal resistence-to-exploitation threshold that minimises total expected utility lost. If someone is sufficiently unable to verify things themselves, the price of believing anyone ever is the corresponding utility of setting a threshold that lets you believe them, and the expected exploitation you’ll be exposed to as a result.

Naturally, other people can’t help you pick this threshold except with arguments you can personally verify, because they’re obviously incentivised to convince you to be more trusting so that you’ll believe /​ be exploitable by them.