TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.
Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.
Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a general theory of intelligent agents. I have a lot of uncertainty about what shape the solution should take even in the broadest brushstrokes, as I outlined in my recent five worlds taxonomy[1]. This is in contrast to uni-agency, where Formal Computational Realism (FCR) is, IMO, pretty close to at least nailing down the correct type signature and qualitative nature of the desiderata.
At the same time, understanding multi-agency seems quite important in the context of AI alignment. There are many sorts of multi-agent interactions that are potentially relevant:
AI-user is in the very core of the problem.
user-[arbitrary agent] is important since the AI is supposed to faithfully “represent” the user in those interactions, and since examining those interactions might be necessary to correctly interpreting the user’s preferences.
[counterfactual user]-[counterfactual user] is relevant to dealing with uncertainty during value learning.
user-user is important for multi-user alignment.
AI-[counterfactual agent] is important when considering inner alignment, since mesaoptimizers can sometimes be regarded as “acausal attacks” by counterfactual agents.
AI-[successor agent] seems important for thinking about self-improving / reproducing agents.
AI-AI is important if we expect a multipole scenario.
This post tells a particular story of how multi-agent theory might look like. In this story, agents converge to a new type of solution concept described in the “stable cycles for multiplayer games” section. (I call this solution “haggling equilibrium”). As opposed to Nash equilibria, the “typical” (but not any) haggling equilibrium in a two-player game is Pareto-efficient. This stands in contrast even to Nash equilibria in repeated games, where Pareto-efficiency is possibly but, due to the folk theorem, very underdetermined.
Moreover, there is an argument that a particular type of robust RL algorithm (robust UCB) would converge to such equilibria under some assumptions. However, the argument is pretty informal and there is not even a rigorous conjecture at present. There are, broadly speaking, two possibilities how the story might be completed:
We promote convergence to haggling equilibrium to a desideratum, and demonstrate algorithms that accomplish it with good statistical and computational efficiency. (This corresponds to the “Economica” world in my five world taxonomy.)
We show that there are reasonable uni-agent desiderata (robust regret bounds and maybe more?) that imply convergence to haggling equilibrium. (This corresponds to the “Harmonia” world in my five world taxonomy.)
With either possibility, the hope is that combining such a result with FCR would promote it to applying in more “exotic” contexts as well, such as one-shot games with transparent source code (along the lines of Demski’s “logical time”).
It is also interesting to study the notion of haggling equilibrium in itself, for example: is there always a Pareto-efficient haggling equilibrium? (True for two players, but I don’t know the answer in general.)
To summarize, the ideas in this post are, AFAIK, novel (although somewhat similar ideas appeared in the literature in the guise of “aspiration-based” algorithms in multi-agent RL, see e.g. Crandall and Goodrich 2013) and might be key to understanding multi-agency. However, the jury is still very much out.
In the terminology of those five worlds, I consider Nihiland and Discord to be quite unlikely, but Linguistica, Economica and Harmonia all seem plausible.
TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.
Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.
Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a general theory of intelligent agents. I have a lot of uncertainty about what shape the solution should take even in the broadest brushstrokes, as I outlined in my recent five worlds taxonomy[1]. This is in contrast to uni-agency, where Formal Computational Realism (FCR) is, IMO, pretty close to at least nailing down the correct type signature and qualitative nature of the desiderata.
At the same time, understanding multi-agency seems quite important in the context of AI alignment. There are many sorts of multi-agent interactions that are potentially relevant:
AI-user is in the very core of the problem.
user-[arbitrary agent] is important since the AI is supposed to faithfully “represent” the user in those interactions, and since examining those interactions might be necessary to correctly interpreting the user’s preferences.
[counterfactual user]-[counterfactual user] is relevant to dealing with uncertainty during value learning.
user-user is important for multi-user alignment.
AI-[counterfactual agent] is important when considering inner alignment, since mesaoptimizers can sometimes be regarded as “acausal attacks” by counterfactual agents.
AI-[successor agent] seems important for thinking about self-improving / reproducing agents.
AI-AI is important if we expect a multipole scenario.
This post tells a particular story of how multi-agent theory might look like. In this story, agents converge to a new type of solution concept described in the “stable cycles for multiplayer games” section. (I call this solution “haggling equilibrium”). As opposed to Nash equilibria, the “typical” (but not any) haggling equilibrium in a two-player game is Pareto-efficient. This stands in contrast even to Nash equilibria in repeated games, where Pareto-efficiency is possibly but, due to the folk theorem, very underdetermined.
Moreover, there is an argument that a particular type of robust RL algorithm (robust UCB) would converge to such equilibria under some assumptions. However, the argument is pretty informal and there is not even a rigorous conjecture at present. There are, broadly speaking, two possibilities how the story might be completed:
We promote convergence to haggling equilibrium to a desideratum, and demonstrate algorithms that accomplish it with good statistical and computational efficiency. (This corresponds to the “Economica” world in my five world taxonomy.)
We show that there are reasonable uni-agent desiderata (robust regret bounds and maybe more?) that imply convergence to haggling equilibrium. (This corresponds to the “Harmonia” world in my five world taxonomy.)
With either possibility, the hope is that combining such a result with FCR would promote it to applying in more “exotic” contexts as well, such as one-shot games with transparent source code (along the lines of Demski’s “logical time”).
It is also interesting to study the notion of haggling equilibrium in itself, for example: is there always a Pareto-efficient haggling equilibrium? (True for two players, but I don’t know the answer in general.)
To summarize, the ideas in this post are, AFAIK, novel (although somewhat similar ideas appeared in the literature in the guise of “aspiration-based” algorithms in multi-agent RL, see e.g. Crandall and Goodrich 2013) and might be key to understanding multi-agency. However, the jury is still very much out.
In the terminology of those five worlds, I consider Nihiland and Discord to be quite unlikely, but Linguistica, Economica and Harmonia all seem plausible.