In order to apply game theory, we need a model of a strategic interaction. And that model needs to include data about what agents are present, what they care about, and what policies they can implement. Following the same aesthetic as the Embedded Agency sequence, I think we can build world models which at a low level of abstraction do not feature agents at all. But then we can apply abstractions to that model to draw boundaries around a collection of parts, and draw a system diagram for how those parts come together to form an agent.
The upshot of this sequence is that we can design systems that work together to model the world and the strategic landscape it implies. Even when those systems were designed for different purposes using different ways to model the world, by different designers with different ways they would like the world to be.
Automating Epistemology
A software system doesn’t need to model the world in order to be useful. Compilers and calculators don’t model the world. But some tasks are most naturally decomposed as “build up a model of the world, and take different actions depending on the features of that model.” A thermostat is one of the simplest systems I can think of that meaningfully implements this design pattern: it has an internal representation of the outside world (“the temperature” as reported by a thermometer), and takes actions that depend on the features of that representation (activate a heater if “the temperature” is too low, activate a cooler if “the temperature” is too high).
A thermostat’s control chip can’t directly experience temperature, any more than our brains can directly experience light bouncing off our shoelaces and hitting our retinas. But there are lawful ways of processing information that predictably lead a system’s model of the world to be correlated with the way the world actually is. That lead a thermostat to register “the temperature is above 78° F” usually only in worlds where the temperature is above 78° F.
Given the right conditions, well-founded beliefs are also contagious.
If your tongue speaks truly, your rational beliefs, which are themselves evidence, can act as evidence for someone else. Entanglement can be transmitted through chains of cause and effect—and if you speak, and another hears, that too is cause and effect. When you say “My shoelaces are untied” over a cellphone, you’re sharing your entanglement with your shoelaces with a friend.
Therefore rational beliefs are contagious, among honest folk who believe each other to be honest.
A software system’s epistemology is the process it uses to take in information, from sensors and other systems, and build up a model of the world. And we would like to use the contagious property of well-founded, honestly reported beliefs to help us to distribute the task of modeling a world that is too big for any one sensor to perceive. Without having that model distorted too much by things like measurement error or deliberate deception.
There are at least three types of claim that I think are relevant to the design of distributed autonomous software systems: logical/computational claims, counterfactual claims, and factual claims.
Logical/Computational Claims
A logical claim looks like “Proof P is a sequence of valid deductions from the set of axioms A to the theorem T”
A computational claim looks like “Program P, when run on input I, produces the output O.”
It turns out that these are isomorphic! That is, given a logical claim, there is an algorithm that will produce the corresponding computational claim, and vice versa. We can work in whichever language is more natural for any given problem we’re trying to solve, and convert back and forth as needed.
These claims can be studied in their own right, and they also form the connective tissue of an epistemology. When one performs a deduction or inference, or any computation at all, there is an implied claim that the computation was done correctly. Logical and computational claims also only take a properly working computer to evaluate. No sensor data required. If one’s computer is powerful enough, one can directly check every step in a proof oneself, or directly run a program on a given input and check its output.
Many computational problems we care about seem to have the property that solutions are hard to find but easy to verify. It’s easier for us to check that a Sudoku puzzle has been solved than it is to find such a solution. We can leverage this property and use faster software for checking logical and computational claims than is needed to produce a valid such claim in the first place.
What if our computer still isn’t powerful enough to directly check a purported proof or computation? It turns out that there are protocols that are analogous to Paul Christiano’s AI Safety via Debate proposal. A proof or computation can be broken up into a sequence of steps which have a local validity property which even a very weak computer can check. One system publishes a claim, and then all of the systems in a wide pool are free to either say “I agree” or “I disagree with this specific step.” Anyone with a properly working computer can check a single step, and thereby know who is being honest. This protocol has the very nice security property that only one honest participant is needed to point out flaws in invalid claims, in order for the system as a whole to only be convinced by valid claims.
Counterfactual Claims
A counterfactual claim looks something like “If this premise were true, then this conclusion would also be true.”
A sub-problem of Embedded Agency is a satisfying account of counterfactuals, at least when it comes to agency. An agent is a part of reality that chooses among policies that it “can” implement. That policy defines that agent’s method of signal processing, how it transforms a stream of inputs into a stream of outputs. Possibly changing internal state along the way, possibly not.
Counterfactuals are also a central data structure in Bayesian reasoning. Given a counterfactual description of how different ways-the-world-could-be lead to different observations being more or less likely, one can use Bayes’ Theorem to reason backwards from an observation to understand how this should shift the probabilities we assign to different ways-the-world-could-be.
One important type of counterfactual claim is those that can be independently verified by running an experiment. Scientific knowledge generally takes the form of an independently verifiable counterfactual claim: “if you perform this experiment, this is the probability distribution over outcomes I expect.” It’s actually possible to define a world model entirely in terms of the expected distribution over outcomes of experiments that one could perform. Bayes’ Theorem defines the lawful relationship between counterfactual claims like “if you look in that box you’ll see a diamond” and factual claims like “there is a diamond in that box.”
Factual Claims
A factual claim looks like “this is a way the world is.” Snow is white, my shoelaces are untied. My computer is working properly. A world model will have a specific structure, and that structure is defined by that model’s ontology. One world model might look like a specific best-guess as to what the world is like. This is fine for thermostats for example. Another world model might explicitly represent uncertainty, for example using a probability distribution over concrete world models. AIXI, for example, uses computer programs as concrete world models, and explicitly tracks how its observations and actions affect the probability of each computer program being an accurate model of reality.
Similarly, some world models only try to track what’s going on right now. This again is fine for thermostats. But we can also explicitly track beliefs about the past and the future. Or other branches of the universal wave function. Or other parts of reality that aren’t causally connected to our part of reality at all. A world model needs to keep track of aspects of reality that matter to the system using that model. And we should expect to deploy our software systems into a world populated with other software systems, designed by other people with other goals and interests. In some cases, we’ll want our software systems to communicate with other systems that use a completely different ontology.
Up next: some factors that can help enable trust between systems, despite differences in ontology, epistemology, and incentives.
An Ontology for Strategic Epistemology
In order to apply game theory, we need a model of a strategic interaction. And that model needs to include data about what agents are present, what they care about, and what policies they can implement. Following the same aesthetic as the Embedded Agency sequence, I think we can build world models which at a low level of abstraction do not feature agents at all. But then we can apply abstractions to that model to draw boundaries around a collection of parts, and draw a system diagram for how those parts come together to form an agent.
The upshot of this sequence is that we can design systems that work together to model the world and the strategic landscape it implies. Even when those systems were designed for different purposes using different ways to model the world, by different designers with different ways they would like the world to be.
Automating Epistemology
A software system doesn’t need to model the world in order to be useful. Compilers and calculators don’t model the world. But some tasks are most naturally decomposed as “build up a model of the world, and take different actions depending on the features of that model.” A thermostat is one of the simplest systems I can think of that meaningfully implements this design pattern: it has an internal representation of the outside world (“the temperature” as reported by a thermometer), and takes actions that depend on the features of that representation (activate a heater if “the temperature” is too low, activate a cooler if “the temperature” is too high).
A thermostat’s control chip can’t directly experience temperature, any more than our brains can directly experience light bouncing off our shoelaces and hitting our retinas. But there are lawful ways of processing information that predictably lead a system’s model of the world to be correlated with the way the world actually is. That lead a thermostat to register “the temperature is above 78° F” usually only in worlds where the temperature is above 78° F.
Given the right conditions, well-founded beliefs are also contagious.
A software system’s epistemology is the process it uses to take in information, from sensors and other systems, and build up a model of the world. And we would like to use the contagious property of well-founded, honestly reported beliefs to help us to distribute the task of modeling a world that is too big for any one sensor to perceive. Without having that model distorted too much by things like measurement error or deliberate deception.
There are at least three types of claim that I think are relevant to the design of distributed autonomous software systems: logical/computational claims, counterfactual claims, and factual claims.
Logical/Computational Claims
A logical claim looks like “Proof P is a sequence of valid deductions from the set of axioms A to the theorem T”
A computational claim looks like “Program P, when run on input I, produces the output O.”
It turns out that these are isomorphic! That is, given a logical claim, there is an algorithm that will produce the corresponding computational claim, and vice versa. We can work in whichever language is more natural for any given problem we’re trying to solve, and convert back and forth as needed.
These claims can be studied in their own right, and they also form the connective tissue of an epistemology. When one performs a deduction or inference, or any computation at all, there is an implied claim that the computation was done correctly. Logical and computational claims also only take a properly working computer to evaluate. No sensor data required. If one’s computer is powerful enough, one can directly check every step in a proof oneself, or directly run a program on a given input and check its output.
Many computational problems we care about seem to have the property that solutions are hard to find but easy to verify. It’s easier for us to check that a Sudoku puzzle has been solved than it is to find such a solution. We can leverage this property and use faster software for checking logical and computational claims than is needed to produce a valid such claim in the first place.
What if our computer still isn’t powerful enough to directly check a purported proof or computation? It turns out that there are protocols that are analogous to Paul Christiano’s AI Safety via Debate proposal. A proof or computation can be broken up into a sequence of steps which have a local validity property which even a very weak computer can check. One system publishes a claim, and then all of the systems in a wide pool are free to either say “I agree” or “I disagree with this specific step.” Anyone with a properly working computer can check a single step, and thereby know who is being honest. This protocol has the very nice security property that only one honest participant is needed to point out flaws in invalid claims, in order for the system as a whole to only be convinced by valid claims.
Counterfactual Claims
A counterfactual claim looks something like “If this premise were true, then this conclusion would also be true.”
A sub-problem of Embedded Agency is a satisfying account of counterfactuals, at least when it comes to agency. An agent is a part of reality that chooses among policies that it “can” implement. That policy defines that agent’s method of signal processing, how it transforms a stream of inputs into a stream of outputs. Possibly changing internal state along the way, possibly not.
Counterfactuals are also a central data structure in Bayesian reasoning. Given a counterfactual description of how different ways-the-world-could-be lead to different observations being more or less likely, one can use Bayes’ Theorem to reason backwards from an observation to understand how this should shift the probabilities we assign to different ways-the-world-could-be.
One important type of counterfactual claim is those that can be independently verified by running an experiment. Scientific knowledge generally takes the form of an independently verifiable counterfactual claim: “if you perform this experiment, this is the probability distribution over outcomes I expect.” It’s actually possible to define a world model entirely in terms of the expected distribution over outcomes of experiments that one could perform. Bayes’ Theorem defines the lawful relationship between counterfactual claims like “if you look in that box you’ll see a diamond” and factual claims like “there is a diamond in that box.”
Factual Claims
A factual claim looks like “this is a way the world is.” Snow is white, my shoelaces are untied. My computer is working properly. A world model will have a specific structure, and that structure is defined by that model’s ontology. One world model might look like a specific best-guess as to what the world is like. This is fine for thermostats for example. Another world model might explicitly represent uncertainty, for example using a probability distribution over concrete world models. AIXI, for example, uses computer programs as concrete world models, and explicitly tracks how its observations and actions affect the probability of each computer program being an accurate model of reality.
Similarly, some world models only try to track what’s going on right now. This again is fine for thermostats. But we can also explicitly track beliefs about the past and the future. Or other branches of the universal wave function. Or other parts of reality that aren’t causally connected to our part of reality at all. A world model needs to keep track of aspects of reality that matter to the system using that model. And we should expect to deploy our software systems into a world populated with other software systems, designed by other people with other goals and interests. In some cases, we’ll want our software systems to communicate with other systems that use a completely different ontology.
Up next: some factors that can help enable trust between systems, despite differences in ontology, epistemology, and incentives.
And after that: how all these pieces fit together into a distributed strategic epistemology.