Vanessa Kosoy comments on Introduction To The Infra-Bayesianism Sequence

Vanessa Kosoy 24 Mar 2021 16:41 UTC
LW: 5 AF: 2
0
AF

The central problem of <@embedded agency@>(@Embedded Agents@) is that there is no clean separation between an agent and its environment...

That’s certainly one way to motivate IB, however I’d like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity (in particular this must be the case if the environment contains other agents of similar or greater complexity).

The contribution of infra-Bayesianism is to show how to formally specify a decision procedure that uses Knightian uncertainty, while still satisfying many properties we would like a decision procedure to satisfy.

Well, the use of Knightian uncertainty (imprecise probability) in decision theory certain appeared in the literature, so it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory (i.e. treating sequential decision making and considering learnability and regret bounds in this setting) and applying that to various other questions (in particular, Newcombian paradoxes).

In particular, one thing that feels a bit odd to me is the choice of worst-case reasoning for the top level—I don’t really see anything that forces that to be the case. As far as I can tell we could get all the same results by using best-case reasoning instead (assuming we modified the other aspects appropriately).

The reason we use worst-case reasoning is because we want the agent to satisfy certain guarantees. Given a learnable class of infra-hypotheses, in the $γ \to 1$ limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don’t get anything analogous with best-case reasoning.

Moreover, there is an (unpublished) theorem showing that virtually any guarantee you might want to impose can be written in IB form. That is, let $E$ be the space of environments, and let $g_{n} : E \to [0, 1]$ be an increasing sequence of functions. We can interpret every $g_{n}$ as a requirement about the policy: $\forall μ : E_{μ π} [U] \geq g_{n} (μ)$ . These requirements become stronger with increasing $n$ . We might then want $π$ to be s.t. it satisfies the requirement with the highest $n$ possible. The theorem then says that (under some mild assumptions about the functions $g$ ) there exists an infra-environment s.t. optimizing for it is equivalent to maximizing $n$ . (We can replace $n$ by a continuous parameter, I made it discrete just for ease of exposition.)

The obvious justification for worst-case reasoning is that it is a form of risk aversion, but it doesn’t feel like that is really sufficient—risk aversion in humans is pretty different from literal worst-case reasoning, and also none of the results in the post seem to depend on risk aversion.

Actually it might be not that different. The Legendre-Fenchel duality shows you can think of infradistributions as just concave expectation functionals, which seems as a fairly general way to add risk-aversion to decision theory. It is also used in mathematical economics, see Peng.

it seems interesting to characterize what makes some rules work while others don’t.

Another rule which is tempting to use (and is known in the literature) is minimax-regret. However, it’s possible to show that if you allow your hypotheses to depend on the utility function then you can reduce it to ordinary maximin.
What links here?
- Vanessa Kosoy's comment on [AN #143]: How to make embedded agents that reason probabilistically about their environments by Rohin Shah (24 Mar 2021 17:51 UTC; 4 points)
- Rohin Shah 24 Mar 2021 17:55 UTC
  LW: 4 AF: 3
  0
  AF Parent
  I’d like to note that even if there was a clean separation between an agent and its environment, it could still be the case that the environment cannot be precisely modeled by the agent due to its computational complexity
  Yeah, agreed. I’m intentionally going for a simplified summary that sacrifices details like this for the sake of cleaner narrative.
  it would be more fair to say that the contribution of IB is combining that with reinforcement learning theory
  Ah, whoops. Live and learn.
  The reason we use worst-case reasoning is because we want the agent to satisfy certain guarantees. Given a learnable class of infra-hypotheses, in the γ→1
  limit, we can guarantee that whenever the true environment satisfies one of those hypotheses, the agent attains at least the corresponding amount of expected utility. You don’t get anything analogous with best-case reasoning.
  Okay, that part makes sense. Am I right though that in the case of e.g. Newcomb’s problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)? (I think I was a bit too focused on the specific UDT / Nirvana trick ideas.)
  Actually it might be not that different. The Legendre-Fenchel duality shows you can think of infradistributions as just concave expectation functionals, which seems as a fairly general way to add risk-aversion to decision theory.
  Yeah… I’m a bit confused about this. If you imagine choosing any concave expectation functional, then I agree that can model basically any type of risk aversion. But it feels like your infra-distribution should “reflect reality” or something along those lines, which is an extra constraint. If there’s a “reflect reality” constraint and a “risk aversion” constraint and these are completely orthogonal, then it seems like you can’t necessarily satisfy both constraints at the same time.
  On the other hand, maybe if I thought about it for longer, I’d realize that the things we think of as “risk aversion” are actually identical to the “reflect reality” constraint when we are allowed to have Knightian uncertainty over some properties of the environment. In that case I would no longer have my objection.
  To be a bit more concrete: imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can’t model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you’d see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but “reflecting reality” might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits.
  I am curious what happens in this scenario if you set the concave expectation functional based on the “risk aversion” setting above, and then use duality to get the “convex set of distributions” formulation—would the resulting object be meaningful to us?
  - Vanessa Kosoy 25 Mar 2021 20:42 UTC
    LW: 4 AF: 3
    0
    AF Parent
    
    Am I right though that in the case of e.g. Newcomb’s problem, if you use the anti-Nirvana trick (getting -infinity reward if the prediction is wrong), then you would still recover the same behavior (EDIT: if you also use best-case reasoning instead of worst-case reasoning)?
    
    Yes
    
    imagine that you know that the even bits in an infinite bitsequence come from a fair coin, but the odd bits come from some other agent, where you can’t model them exactly but you have some suspicion that they are a bit more likely to choose 1 over 0. Risk aversion might involve making a small bet that you’d see a 1 rather than a 0 in some specific odd bit (smaller than what EU maximization / Bayesian decision theory would recommend), but “reflecting reality” might recommend having Knightian uncertainty about the output of the agent which would mean never making a bet on the outputs of the odd bits.
    
    I think that if you are offered a single bet, your utility is linear in money and your belief is a crisp infradistribution (i.e. a closed convex set of probability distributions) then it is always optimal to bet either as much as you can or nothing at all. But for more general infradistributions this need not be the case. For example, consider $X := {0, 1}$ and take the set of a-measures generated by $3 δ_{0}$ and $δ_{1}$ . Suppose you start with $\frac{1}{2}$ dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting $\frac{1}{4}$ dollars on the outcome $1$ , with a value of $\frac{3}{4}$ dollars.
    - Rohin Shah 25 Mar 2021 21:25 UTC
      LW: 2 AF: 2
      1
      AF Parent
      But for more general infradistributions this need not be the case. For example, consider $X := {0, 1}$ and take the set of a-measures generated by $3 δ_{0}$ and $δ_{1}$ . Suppose you start with $\frac{1}{2}$ dollars and can bet any amount on any outcome at even odds. Then the optimal bet is betting $\frac{1}{4}$ dollars on the outcome $1$ , with a value of $\frac{3}{4}$ dollars.
      I guess my question is more like: shouldn’t there be some aspect of reality that determines what my set of a-measures is? It feels like here we’re finding a set of a-measures that rationalizes my behavior, as opposed to choosing a set of a-measures based on the “facts” of the situation and then seeing what behavior that implies.
      I feel like we agree on what the technical math says, and I’m confused about the philosophical implications. Maybe we should just leave the philosophy alone for a while.
      - Vanessa Kosoy 29 Mar 2021 16:26 UTC
        LW: 4 AF: 3
        0
        AF Parent
        IIUC your question can be reformulated as follows: a crisp infradistribution can be regarded as a claim about reality (the true distribution is inside the set), but it’s not clear how to generalize this to non-crisp. Well, if you think in terms of desiderata, then crisp says: if distribution is inside set then we have some lower bound on expected utility (and if it’s not then we don’t promise anything). On the other hand non-crisp gives a lower bound that is variable with the true distribution. We can think of non-crisp infradistirbutions as being fuzzy properties of the distribution (hence the name “crisp”). In fact, if we restrict ourselves to either of homogenous, cohomogenous or c-additive infradistributions, then we actually have a formal way to assign membership functions to infradistirbutions, i.e. literally regard them as fuzzy sets of distributions (which ofc have to satisfy some property analogous to convexity).
  - Diffractor 24 Mar 2021 19:17 UTC
    LW: 3 AF: 3
    0
    AF Parent
    If you use the Anti-Nirvana trick, your agent just goes “nothing matters at all, the foe will mispredict and I’ll get -infinity reward” and rolls over and cries since all policies are optimal. Don’t do that one, it’s a bad idea.
    
    For the concave expectation functionals: Well, there’s another constraint or two, like monotonicity, but yeah, LF duality basically says that you can turn any (monotone) concave expectation functional into an inframeasure. Ie, all risk aversion can be interpreted as having radical uncertainty over some aspects of how the environment works and assuming you get worst-case outcomes from the parts you can’t predict.
    
    For your concrete example, that’s why you have multiple hypotheses that are learnable. Sure, one of your hypotheses might have complete knightian uncertainty over the odd bits, but another hypothesis might not. Betting on the odd bits is advised by a more-informative hypothesis, for sufficiently good bets. And the policy selected by the agent would probably be something like “bet on the odd bits occasionally, and if I keep losing those bets, stop betting”, as this wins in the hypothesis where some of the odd bits are predictable, and doesn’t lose too much in the hypothesis where the odd bits are completely unpredictable and out to make you lose.
    - Rohin Shah 24 Mar 2021 20:51 UTC
      LW: 2 AF: 2
      0
      AF Parent
      If you use the Anti-Nirvana trick, your agent just goes “nothing matters at all, the foe will mispredict and I’ll get -infinity reward” and rolls over and cries since all policies are optimal. Don’t do that one, it’s a bad idea.
      Sorry, I meant the combination of best-case reasoning (sup instead of inf) and the anti-Nirvana trick. In that case the agent goes “Murphy won’t mispredict, since then I’d get -infinity reward which can’t be the best that I do”.
      For your concrete example, that’s why you have multiple hypotheses that are learnable.
      Hmm, that makes sense, I think? Perhaps I just haven’t really internalized the learning aspect of all of this.