I wrote this post during the first part of the SERI MATS winter school, in Vanessa Kosoy’s research stream. This is just an explanation of my intended research direction and the motivation behind it, and write down proofs for the statements in Vanessa’s comment describing the research direction.
The motivation behind infra-Bayesianism is to develop a better theory of intelligence.
Why do we need a theory of intelligence? When we build superhuman AIs, we can have two kinds of reasons to believe in their safety:
- empirical evidence coming from experiments with less capable AIs;
- mathematically proven guarantees of its performance.
Empirical evidence on its own may be unreliable, because, as the system capabilities improve, results may fail to generalize as we expected. We need a theory to give us performance guarantees and tell us how to correctly extrapolate observed results to new conditions.
Why do we need a better theory? The ideal Bayesian agent, known as AIXI, has several shortcomings, and one of them is realizability assumption. In order for a Bayesian agent to have performance guarantees: the true hypothesis must have a positive probability in the agent’s prior. However, this is unachievable in the real world, because the agent is contained in the environment, and so trying to have a true hypothesis about the environment runs into self-reference problems and computational constraints.
A related problem in game theory is the grain-of-truth problem. The term comes from the paper by Ehud Kalai and Ehud Lehrer , in which they prove that in an infinitely repeated game, subjectively rational players—i.e. players maximizing their expected reward according to their subjective beliefs about what strategies the opponent is likely to play—will learn to predict their opponents’ moves and their strategies will converge to Nash equilibrium strategies, provided that the “grain of truth” requirement is met: each player must assign a positive prior probability to any set of histories that has a positive chance of occurringk in the game.
Unfortunately, the “grain of truth” requirement is very strict, and when it’s not met, learning fails. Foster and Young  prove that in a repeated 2-player game with uncertain payoff, rational agents will fail to learn each other’s strategies and fail to converge to a Nash equilibrium, except for a set of games of measure 0. John H Nachbar  demonstrates that, loosely speaking, for any sufficiently rich set of strategies, it is impossible for each player to both predict their opponent’s strategy and have their own strategy be the best response to their beliefs about their opponent at the same time.
Infra-Bayesianism is a possible solution to the “grain of truth problem”, as it allows agents to have partial models of their environment. In this way, infra-Bayesian beliefs are similar to physical laws. For example, Newton’s laws of motion can predict the path of a falling object, but not the weather tomorrow or the outcome of an election. Instead of using a single probability distribution, Infra-Bayesianism defines hypotheses as convex sets of probability distributions. This allows Infra-Bayesian hypotheses to be agnostic about certain parts of the world.
Our hypothesis is that infra-Bayesian agents will have an advantage in iterated games over Bayesian agents. I want to try and compare their performance in iterated game setting. It will be interesting to see if infra-Bayesian agents have better convergence to Nash equilibria.
The folk theorem states that any payoff vector above the minimax of each player can be achieved by a subgame-perfect equilibrium of a repeated game. For 2-player games, minimax and maximin payoffs in mixed strategies are the same. Therefore the best guaranteed payoff in a 2-player game is the maximin payoff. However, achieving it would not be very interesting, because to receive a maximin payoff, the agent only needs to learn its own rewards, and doesn’t need to learn about other agents’ policies.
This is why it is interesting to consider another simple kind of repeated game: a population game. Instead of a fixed set of agents playing with each other, we consider populations of agents. At each turn, a random agent is selected from each population to participate in a one-shot game with players, which we refer to as the stage game. It is assumed that all populations are sufficiently large that the effect of any individual agent violating from their policy on the distribution of histories of other agents would be negligible.
In an immortal population game, the expected rewards of future games are discounted by a time factor of . In a mortal population game, the rewards are not discounted but instead each agent has a chance of dying after each turn. Each turn all dead agents are replaced with new agents with empty histories.
Let us introduce some notations:
is the set of pure strategies (actions) of the th agent in the game. We assume are finite sets.
[I’ll deal with deterministic rewards first, then see what happens in more complicated situations. For a deterministic reward, we can assume that a player knows his reward function.]
is the set of pure outcomes of a stage game. We assume that each player knows his own deterministic reward function .
The set of -round outcome histories is and is the set of all finite-length histories.
is the length of a history. is the length of a history .
Definition (distance between histories)
The distance between partial histories is defined as
where is some constant, is the first time the histories diverge, i.e. the smallest such that exactly one of the histories has fewer than steps, or they both have at least steps and their step is different. If the histories are identical, the distance is defined to be 0. It is easy to see that this is indeed a metric.
represents the set of probability distributions on .
Definition (distance between measures)
The Kantorovich-Rubinstein (KR) metric on the space is defined as follows:
where supremum is taken over all 1-Lipschitz
is the policy of player number i. It determines which mixed strategy player number chooses, given the full history it has seen so far. Upon seeing history , the probability that player will choose action is , which I may also write down as for brevity.
is the distribution of histories in the i-th population at the th turn of the repeated game.
Note how it is entirely possible for distributions of histories for different players to be different. For example, suppose that player 1 has two possible actions, and his policy is to take action in the first game randomly with probability 50% each, and then always repeat this action. Then histories of players in distribution 1 will have repeated actions for player 1, while histories of players in distribution 2 will have random actions for player 1, because player 1 is sampled randomly from the distribution 1, and each move is equally likely.
A set of policies is said to be in equilibrium if playing a different strategy for any agent, while all other agents (including other agents in population ) stick to their policies, does not result in greater expected reward.
What does an equilibrium look like? Populations are assumed to be sufficiently large that an effect of any individual agent’s action on the distribution of histories in the population is negligible. At turn , opponent number is going to play a mixed strategy with probability . So his moves follow probability distribution . It means that the best policy for an individual agent in population is to choose a best response policy to mixed policies at turn . (“Bar” notation means the element in list is skipped). Since each agent follows a best response policy, their probabilistic mixture is also a best response policy. So the weighted averages of the players’ policies at every turn are in a Nash equilibrium strategy for the single-turn game.
It is interesting to find out if choosing to be learning strategies will result in convergence to a Nash equilibrium in the limit of .
The evolution rule that produces from works as follows.
For the first step, for each we sample a history for player from Then we sample actions according to policies . We append the outcome of the game to the history of player number .
For the second step, each player dies with probability and is replaced with a new player with an empty history.
The first step is described by the formula
where stands for history without the last step, for the actions taken at the last step of h, and the sums on the right-hand side are taken over histories of all players except for player .
The second step is obtained by
where stands for the Dirac delta measure of an empty history.
is obtained from by a linear transformation, so it is enough to prove continuity of .
I am going to write down a proof for , it generalizes straightforwardly to any .
For any 1-Lipschitz function ,
Let us look at the second summand.
because and .
Choose such that .
Then the sum
Let us upper bound the first summand in (2):
where if and 0 otherwise.
Then is a 1-Lipschitz function from to [-1, 1].
Indeed, , so for any two histories and ,
Therefore, this expression can be bounded from above by
Now for the last summand in (2):
Bringing them together:
If follows that for any such that , this sum is less than .
Since every history has possible successors after 1 step,
We proved that for any such that the second summand in (1) is bounded from above by
The same argument proves that for any such that , the first summand is bounded from above by .
Now let us upper bound the last summand in (1).
It can be written down as
is the probability that player with a history will end up with history after one step. We can rewrite this expression as
because if is not equal to the history without the last step, the probability is equal to zero. Now rewrite this as
Let us prove that if is a 1-Lipschitz function , the same is true of .
The inequalities are obvious. For any two different histories and , any their two continuation histories and are at the same distance , and therefore . Since the integrands are close, the integrals must be close too:
The same way,
and so is 1-Lipschitz.
Therefore, the third summand is bounded from above by , so if we choose so that , so that and so that , we’ll get . Qed.
Notice how we also proved that if and are kept fixed, does not increase the distance, and therefore is a contraction.
Combining all components gives us the map .
We can extend a finite history to an infinite history by appending an infinite number of blank or “placeholder” turns represented by the symbol at the end of the original history. With this notation, is a subset of . The set (with discrete topology) is compact, and therefore their product is a compact space by the Tychonoff’s theorem. can be seen as a subset of : we take a probability distribution on and for any subset define.
The distance on is defined just like on : the distance between two histories is where is the first time the two histories diverge.
The topology induced on by this metric is the product topology on .
The product topology on can be generated by basis sets . The metric topology is generated by the unit balls . These two families coincide, since , where is the smallest integer such that . Therefore the topologies also coincide.
Since is compact, it must also be a complete space. The set of sequences with finitely many non- elements is countable and dense in . So is a separable space. This means that is a separable, completely metrizable topological space, also known as a Polish space.
Because is compact, the space of probability distributions over it, is compact in the weak* topology. (See theorem 10.2 in )
However, we are looking for fixed points among probability distributions on and, unfortunately, is not compact as is not closed in (. (A sequence of finite histories that are continuations of each other has a limit in , but it’s not a finite history.)
On the other hand, we can observe that a fixed distribution in has the following property:
If the proportion of players in population with a history length of is in the th generation, then the proportion of players in population with a history length of in the next generation will be x, since is the proportion of players with a -long history that survive the turn. This implies that in a fixed distribution the age-distribution must satisfy . This means that .
We call the subset of in which every population’s age distribution is .
is closed in the weak* topology.
Suppose is a converging sequence of measures, . It means that for any continuous function ,
For elements of , we can define length as the number of symbols before the first ocurrence of symbol (if there is no symbol , the length is infinite).
For a fixed nonnegative integer , consider the function which is equal to 1 if has length , and 0 otherwise. It is continuous, because if length of a history is , . The integral measures the proportion of histories in the distribution which have length ; the limit distribution then must preserve the age distribution and also belong in the set
As a closed subset of a compact set, is compact. It is also easy to see that is convex.
has a fixed point.
The Schauder fixed-point theorem implies that a continuous map from a compact convex subset of a Hausdorff topological vector space to itself has a fixed point. We showed that P is compact and convex, and T is continuous.
The set is a subset of the topological vector space of all finite signed measures on with weak* topology. Let us show that this vector space is Hausdorff. Take any two signed measures and , and suppose that no continuous function on separates them: . Then . The functional is a positive linear functional on . The space is metric and therefore Hausdorff; it is compact and therefore locally compact. Therefore, by the Riesz representation theorem, the measure corresponding to this functional is unique: and therefore .
Another possibly interesting observation is that the age distribution of histories is going to converge to the limit distribution.
The age distribution of histories converges to the distribution in the norm .
Since the map adds 1 step to every history and then replaces each agent with a newborn with probability , it induces a map on age distributions:
on age distributions. For this map is a fixed point and . It follows that any initial age distribution converges to .
What happens in the fixed points? In a fixed point, the distribution of opponents stays the same each term. It means that a player observes a sequence of independent identically distributed (IID) outcomes of their policy. Suppose that the policy satisfies the following property: upon observing an IID sequence of games, it converges to the optimal response in the limit of . Then in this limit, fixed points will also be Nash equilibria.
The property above can be satisfied even by very simple algorithms, for example, one that assumes that opponent’s move is randomly sampled from the set of his previous moves. Any Bayesian agent that includes all IID processes in its prior will satisfy this property.
Some interesting questions for research are:
Are any/all of the fixed points attractors?
Does convergence to a fixed point occur for all or at least almost all initial conditions?
Do all Nash equilibria correspond to fixed points?
Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?
 Kalai, Ehud, and Ehud Lehrer. “Rational learning leads to Nash equilibrium.” Econometrica: Journal of the Econometric Society (1993): 1019-1045.
 Dean P Foster and H Peyton Young. “On the
impossibility of predicting the behavior of rational agents.” Proceedings of the National
Academy of Sciences, 98(22):12848–12853,
 Nachbar, John H. “Beliefs in repeated games.” Econometrica 73.2 (2005): 459-480.
 Hadfield-Menell D. et al. Cooperative inverse reinforcement learning //Advances in neural information processing systems. – 2016. – Т. 29.
Hadfield-Menell D. et al. The off-switch game //Workshops at the Thirty-First AAAI Conference on Artificial Intelligence. – 2017.
Ergodic Theory by Charles Walkden, lecture 10