Building Trust in Strategic Settings

In the previous post, we saw some different types of claims that we might want software systems to automatically evaluate. Ideally, we’d like our software systems to incorporate accurate and useful information, and disregard inaccurate or irrelevant information.

In this post, I’ll describe some factors that can help our systems to build accurate models of the world, using information from their own sensors and other systems, even when those systems were designed for different purposes using different ways to model the world, by different designers with different ways they would like the world to be.

Translating Between Ontologies

For software systems, all factual claims start out in an ontology that looks like “this is the data that came in from my sensor channels.” For a thermostat, that looks like “this is the signal I received on the channel that my thermometer should be hooked up to.” An adversary might have chopped off the thermometer, and be feeding my thermostat false signals, but they are feeding those false signals into channels on my thermostat where the thermometer signals are supposed to go. It is a further and separate factual claim that “all my sensors are calibrated and configured correctly, and are feeding signals to their designated channels as expected.”

For many systems, like household thermostats, it’s fine to take sensor data at face value. But other times we’ll want our systems to have ways of handling the possibility that they’re receiving inaccurate information, or even being actively misled by an adversary. When it’s clear from context, I’ll drop the qualifier that a software system only receives alleged sensor data on their corresponding channels. But a system can honestly report “this came in on the channel designated for my sensor” without further claiming that this information came from a sensor that is working properly and not being misled.

Once we have sensor data, we can apply an epistemology to translate from that sensor data ontology to one that better suits our needs. Like taking camera footage and using that to reconstruct a 3D model of a scene and the objects in it. We could also take that same camera footage and attempt to draw rectangles around the faces we recognize.

Suppose that we are designing a software system, where the most natural way to represent the world is in terms of 3D objects in a 3D scene. And we want it to exchange information with a system that represents the world in terms of rectangles and identity references. How can these systems communicate?

In the easiest case, our systems can simply exchange footage snippets which caused them to form the beliefs they did. Our system can apply its own epistemology and come to its own conclusions, in its own native ontology. But can we directly translate claims from one ontology to another?

Sure. In some cases, like converting metric to imperial, this is easy. In general, writing a bridge between ontologies could require busting out Bayes’ Theorem. This also works for ontologies that are based on different types of sensor data. The same way-the-world-is gives rise to both the camera and microphone data in the same room. Knowing what a microphone heard can be used to make inferences about what a camera would have seen.

If tackling this problem in generality sounds like a big pain, that’s because it is. The counterfactuals that Bayesian reasoning uses generally grow exponentially with the number of features in your model, unless you can find some kind of structure in your problem domain that keeps things under control. “In the worst case we can apply Bayes’ Theorem” is fine for modeling ontologies mathematically. But in practice we would much rather just run our epistemology directly on their sensor data, rather than compute backwards from the output of their epistemology. There might be good reasons why our system can’t simply look at their sensor data, but we can use techniques from cryptography to get much of the same benefits without releasing sensitive information.

Another reason to try to coordinate on using standardized ontologies is that there are economies of scale to using standardized data formats. If you’re starting a new software project, using a popular format means being immediately compatible with every system that uses that format. Using a custom ontology, and a corresponding custom epistemology to form beliefs in that ontology, creates a new little island of beliefs inside your system’s head. It takes a bridge to get evidential impact onto the island, and different bridge to get evidential impact off the island. And it takes work for other systems and designers to understand how your system’s epistemology gives rise to its beliefs.

Note that if you know a system’s epistemology, that epistemology doesn’t need to be rational in order for honest claims to still be Bayesian evidence about how the world is. Rationality just lets us skip modelling what sensor data they must have seen to form the beliefs that they did. An irrational belief isn’t contagious to a rational mind, but a different belief might have been there instead if the sensor data received had been different.

If we know how a potentially-dishonest system maps from sensor data to claims, even dishonest claims are no problem if we can apply Bayesian reasoning. Just ignore the semantic content of the claim itself and focus on which ways-the-world-could-be make that claim more or less likely. And of course being uncertain of exactly how a system maps from sensor data to claims is no problem either. That’s just one more way-the-world-could-be, and different claims make different mappings more or less likely.

Credible Signals

But again, applying Bayesian reasoning to mappings like “sensor data to claims” is often impractical or would take more resources than is worth expending to acquire the information represented by a potentially-dishonest claim. In the absence of incentives favoring honesty, many claims are just cheap talk and the best thing our system can do is simply ignore them.

However! Incentives favoring honesty might be naturally present, or can be deliberately created! This is why an auctioneer, even with no interest in the revenue from an auction and that only wants to award an item to the party that would most benefit from it, still has a reason to ask bidders to pay a cost in order to be declared the winner. We say that a mechanism is dominant-strategy incentive-compatible when honesty is an optimal strategy, no matter what other participants are doing, and many mechanisms like auctions can be designed to have this property.

Another favorite honesty-eliciting mechanism in rationalist circles is the prediction market! A trusted party is designated to measure an outcome, and pay out contracts depending on the measured outcome. The price of these contracts is an aggregate signal about the beliefs of the participants. And individual participants have an incentive to buy or sell these contracts based on whether they think the price is too low or high respectively. This action raises or lowers the price of the contract respectively, adjusting the market price in the direction of that participant’s belief. On expectation, a participant makes or loses money exactly when the beliefs they act-as-if-they-have are more accurate than the market’s. And so according to each participant’s expectations, at least as far as the endogenous incentives of the prediction market itself are concerned, their best strategy is to act-as-if-they-have the beliefs that they actually do have. In other words, to bet honestly. After having updated their beliefs upon seeing the market price and trading activity by other participants, which are evidence about what the future measured value will be.

A bet is a tax on bullshit. It’s one thing for me to say that an idea has promise, and quite another for me to invest a large fraction of my own wealth in a venture which only generates a return if that idea actually pans out. If two honest parties disagree within the same ontology, they can bet about how future sensor data will turn out. A willingness to put one’s money where one’s mouth is credibly signals that a party genuinely believes what they claim.

The ability to say to a system “show me the data” is a great fallback when two systems’ world models are too hard to bridge. This ability to spot-check any claim can also help to create incentives favoring honesty. During conditions of high trust, our systems can take attestations at face value most of the time, recording which attestations they receive. If trust weakens, or just as a background due diligence task, our systems can audit attestations they’ve received to see if they contain errors or likely deceptions.

Making claims publicly can also help to align incentives towards honesty. Others can publicly publish criticisms, and many criticisms like “a computation underlying this claim was performed incorrectly and here’s the step where it went wrong” are easily verified automatically. We can use tools like smart contracts to post bounties on solutions to computational problems in general, and exhibiting proofs of misbehavior by the systems we trust in particular. A giant bounty pool on proofs of misbehavior, with a long track track record of not having been claimed, is a credible signal that a system is trustworthy.

Privacy and Deception

There are plenty of circumstances where I endorse software systems withholding information. Such as when that information is private, or hazardous to disclose. There are also times when I endorse deception, such as performing a feint during a fight with a hostile stranger, or misleading an enemy about the disposition of one’s military forces. My current approach to honesty is exactly the one put forward by Yudkowsky, and I think that standard can be applied to software systems as well.

In particular, I think that deception should only be performed by software in very specific circumstances, deliberately chosen and documented by their designers. And that designers should never engage in deception about under what circumstances their software systems will engage in deception.

Cryptography

How do web browsers know that they’re interacting with the web server that their user is expecting? How do they know whether the information being exchanged is being kept secret from eavesdroppers? Cryptography plays a major role in the epistemologies built into modern software systems, and I expect it to only play a larger role as technology develops.

Some modern cameras will now emit digital signatures alongside the images they take, attesting that this image really came from a camera with enough tamper-resistance that it should be difficult to extract the private keys necessary to forge such a signature. Trusted Platform Modules can perform a similar attestation function when it comes to questions like “did my computer follow the boot sequence I expected, or could I be using an operating system that has been maliciously subverted?”

I’ll start referring to claims accompanied with cryptographic assurances of their legitimacy as attestations. The exact forms attestations take, and the cryptographic assurances provided, depend on the intended application. Do we want to be able to publish a proof that a party engaged in misbehavior? Or do we want private communications that enable the intended recipient to know that a message came a specific party, without being able to prove that to any third party?

There are exciting techniques from the field of cryptography, such as zero-knowledge proofs and secure multi-party computation, which enable system designers to very precisely engineer the flow of information between parties. This is helpful if, for example, you’re designing a software system that needs to share some, but not all, of the information it knows with other systems.

The Trust Graph

I think that in this context, we can operationalize epistemic trust as “the degree to which one system’s claims influence another system’s beliefs.” Cryptographic assurances of data integrity and authenticity help enable trust. Incentives favoring honesty help enable trust. Understanding exactly how another system works, and what leads it to make the claims it does (epistemic legibility), helps enable trust.

Towards one end of the trust spectrum, a system might trust another completely. Completely overwriting whatever previous belief a system might have held. This type of epistemic corrigibility is helpful when we want users of a software system to specify some or all of the beliefs that a system should hold. Such as adding or removing a certificate authority from a system’s collection of entities that can serve as a valid anchor for a chain of trust.

Towards the other end of the trust spectrum, a system’s claims can simply be ignored. This seems like a very reasonable default, and I recommend an architecture which only adds other systems as trusted sources of information as a well-reasoned and endorsed decision. This can be done manually by system designers and users, and might also be possible to do automatically using tools like reputation systems or formal verification.

These trust relationships define a directed graph, which shows the paths along which information can flow between systems. Trust isn’t always transitive, I might trust a system but not trust all the systems it trusts, but as long as systems cite their sources I can pick and choose among their attestations to find the ones I trust. I expect this trust graph to have continents and islands, where beliefs are contagious among systems that trust each other.

Up next: we see how all these pieces fit together into a distributed strategic epistemology.