Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 13 Nov 2019 17:36 UTC
LW: 2 AF: 1
0
AF
Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more “philosophical” defense of maximin is possible, analogous to VNM / complete class theorems, but I don’t know (I actually saw some papers in that vein but haven’t read them in detail.)

If the agent has random bits that Omega doesn’t see, and Omega is predicting the probabilities of the agent’s actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven’t worked out the details. Specifically, I think that we can define some function $X$ that depends on the agent’s actions and Omega’s predictions so far (a measure of Omega’s apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of $X$ over time is finite with probability 1. Then, we consider consider a family of models, where model number $n$ says that $X < n$ for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.

EDIT 1: I think $X$ should be something like, how much money would a gambler following a particular strategy win, betting against Omega.

EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses $1$ dollar. Every time the agent one-boxes, the gambler wins $\frac{1}{p} - 1$ dollars, where $p$ is the probability Omega assigned to one-boxing. Now it’s possible to see that one-boxing guarantees the “CC” payoff under the corresponding model (in the $γ \to 1$ limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace “one-boxes” by “follows the FDT strategy”.
What links here?
- Vanessa Kosoy's comment on Vanessa Kosoy’s Shortform by Vanessa Kosoy (5 Jan 2020 16:54 UTC; 13 points)