Said actions or lack thereof cause a fairly low utility differential compared to the actions in other, non-doomy hypotheses. Also I want to draw a critical distinction between “full knightian uncertainty over meteor presence or absence”, where your analysis is correct, and “ordinary probabilistic uncertainty between a high-knightian-uncertainty hypotheses, and a low-knightian uncertainty one that says the meteor almost certainly won’t happen” (where the meteor hypothesis will be ignored unless there’s a meteor-inspired modification to what you do that’s also very cheap in the “ordinary uncertainty” world, like calling your parents, because the meteor hypothesis is suppressed in decision-making by the low expected utility differentials, and we’re maximin-ing expected utility)
Something analogous to what you are suggesting occurs. Specifically, let’s say you assign 95% probability to the bandit game behaving as normal, and 5% to “oh no, anything could happen, including the meteor”. As it turns out, this behaves similarly to the ordinary bandit game being guaranteed, as the “maybe meteor” hypothesis assigns all your possible actions a score of “you’re dead” so it drops out of consideration.
The important aspect which a hypothesis needs, in order for you to ignore it, is that no matter what you do you get the same outcome, whether it be good or bad. A “meteor of bliss hits the earth and everything is awesome forever” hypothesis would also drop out of consideration because it doesn’t really matter what you do in that scenario.
To be a wee bit more mathy, probabilistic mix of inframeasures works like this. We’ve got a probability distribution ζ∈ΔN, and a bunch of hypotheses ψi∈□X, things that take functions as input, and return expectation values. So, your prior, your probabilistic mixture of hypotheses according to your probability distribution, would be the function
It gets very slightly more complicated when you’re dealing with environments, instead of static probability distributions, but it’s basically the same thing. And so, if you vary your actions/vary your choice of function f, and one of the hypotheses ψi is assigning all these functions/choices of actions the same expectation value, then it can be ignored completely when you’re trying to figure out the best function/choice of actions to plug in.
So, hypotheses that are like “you’re doomed no matter what you do” drop out of consideration, an infra-Bayes agent will always focus on the remaining hypotheses that say that what it does matters.
Well, taking worst-case uncertainty is what infradistributions do. Did you have anything in mind that can be done with Knightian uncertainty besides taking the worst-case (or best-case)?
And if you were dealing with best-case uncertainty instead, then the corresponding analogue would be assuming that you go to hell if you’re mispredicted (and then, since best-case things happen to you, the predictor must accurately predict you).
This post is still endorsed, it still feels like a continually fruitful line of research. A notable aspect of it is that, as time goes on, I keep finding more connections and crisper ways of viewing things which means that for many of the further linked posts about inframeasure theory, I think I could explain them from scratch better than the existing work does. One striking example is that the “Nirvana trick” stated in this intro (to encode nonstandard decision-theory problems), has transitioned from “weird hack that happens to work” to “pops straight out when you make all the math as elegant as possible”. Accordingly, I’m working on a “living textbook” (like a textbook, but continually being updated with whatever cool new things we find) where I try to explain everything from scratch in the crispest way possible, to quickly catch up on the frontier of what we’re working on. That’s my current project.
I still do think that this is a large and tractable vein of research to work on, and the conclusion hasn’t changed much.
Availability: Almost all times between 10 AM and PM, California time, regardless of day. Highly flexible hours. Text over voice is preferred, I’m easiest to reach on Discord. The LW Walled Garden can also be nice.
A note to clarify for confused readers of the proof. We started out by assuming □(cross→U=−10), and cross. We conclude □(cross→U=10)∨□(cross→U=0) by how the agent works. But the step from there to □⊥ (ie, inconsistency of PA) isn’t entirely spelled out in this post.Pretty much, that follows from a proof by contradiction. Assume con(PA) ie ¬□⊥, and it happens to be a con(PA) theorem that the agent can’t prove in advance what it will do, ie, ¬□(¬cross). (I can spell this out in more detail if anyone wants) However, combining □(cross→U=−10) and □(cross→U=10) (or the other option) gets you □(¬cross), which, along with ¬□(¬cross), gets you ⊥. So PA isn’t consistent, ie, □⊥.
In the proof of Lemma 3, it should be “Finally, since χFC(z,z)=z, we have that polyFC(z)⋅polyFB∖C(z)=QFz.
Thus, QFz⋅QFx∩y∩z and QFx∩z⋅QFy∩z are both equal to polyFC(x∩z)⋅polyFB∖C(y∩z)⋅polyFC(z)⋅polyFB∖C(z).instead.
Any idea of how well this would generalize to stuff like Chicken or games with more than 2-players, 2-moves?
I was subclinically depressed, acquired some bupropion from Canada, and it’s been extremely worthwhile.
I don’t know, we’re hunting for it, relaxations of dynamic consistency would be extremely interesting if found, and I’ll let you know if we turn up with anything nifty.
Looks good. Re: the dispute over normal bayesianism: For me, “environment” denotes “thingy that can freely interact with any policy in order to produce a probability distribution over histories”. This is a different type signature than a probability distribution over histories, which doesn’t have a degree of freedom corresponding to which policy you pick.But for infra-bayes, we can associate a classical environment with the set of probability distributions over histories (for various possible choices of policy), and then the two distinct notions become the same sort of thing (set of probability distributions over histories, some of which can be made to be inconsistent by how you act), so you can compare them.
I’d say this is mostly accurate, but I’d amend number 3. There’s still a sort of non-causal influence going on in pseudocausal problems, you can easily formalize counterfactual mugging and XOR blackmail as pseudocausal problems (you need acausal specifically for transparent newcomb, not vanilla newcomb). But it’s specifically a sort of influence that’s like “reality will adjust itself so contradictions don’t happen, and there may be correlations between what happened in the past, or other branches, and what your action is now, so you can exploit this by acting to make bad outcomes inconsistent”. It’s purely action-based, in a way that manages to capture some but not all weird decision-theoretic scenarios.In normal bayesianism, you do not have a pseudocausal-causal equivalence. Every ordinary environment is straight-up causal.
Re point 1, 2: Check this out. For the specific case of 0 to even bits, ??? to odd bits, I think solomonoff can probably get that, but not more general relations.Re: point 3, Solomonoff is about stochastic environments that just take your action as an input, and aren’t reading your policy. For infra-Bayes, you can deal with policy-dependent environments without issue, as you can consider hard-coding in every possible policy to get a family of stochastic environments, and UDT behavior naturally falls out as a result from this encoding. There’s still some open work to be done on which sorts of policy-dependent environments like this are learnable (inferrable from observations), but it’s pretty straightforward to cram all sorts of weird decision-theory scenarios in as infra-Bayes hypothesis, and do the right thing in them.
Ah. So, low expected utility alone isn’t too much of a problem. The amount of weight a hypothesis has in a prior after updating depends on the gap between the best-case values and worst-case values. Ie, “how much does it matter what happens here”. So, the stuff that withers in the prior as you update are the hypotheses that are like “what happens now has negligible impact on improving the worst-case”. So, hypotheses that are like “you are screwed no matter what” just drop out completely, as if it doesn’t matter what you do, you might as well pick actions that optimize the other hypotheses that aren’t quite so despondent about the world.In particular, if all the probability distributions in a set are like “this thing that just happened was improbable”, the hypothesis takes a big hit in the posterior, as all the a-measures are like “ok, we’re in a low-measure situation now, what happens after this point has negligible impact on utility”. I still need to better understand how updating affects hypotheses which are a big set of probability distributions so there’s always one probability distribution that’s like “I correctly called it!”.The motivations for different g are: If g is your actual utility function, then updating with g as your off-event utility function grants you dynamic consistency. Past-you never regrets turning over the reins to future you, and you act just as UDT would.If g is the constant-1 function, then that corresponds to updates where you don’t care at all what happens off-history (the closest thing to normal updates), and both the “diagonalize against knowing your own action” behavior in decision theory and the Nirvana trick pops out for free from using this update.
“mixture of infradistributions” is just an infradistribution, much like how a mixture of probability distributions is a probability distribution.Let’s say we’ve got a prior ζ∈ΔN, a probability distribution over indexed hypotheses.If you’re working in a vector space, you can take any countable collection of sets in said vector space, and mix them together according to a prior ζ∈ΔN giving a weight to each set. Just make the set of all points which can be made by the process “pick a point from each set, and mix the points together according to the probability distribution ζ“For infradistributions as sets of probability distributions or a-measures or whatever, that’s a subset of a vector space. So you have a bunch of sets Ψi, and you just mix the sets together according to ζ, that gives you your set Ψζ.If you want to think about the mixture in the concave functional view, it’s even nicer. You have a bunch of ψi:(X→R)→R which are “hypothesis i can take a function and output what its worst-case expectation value is”. The mixture of these, ψζ, is simply defined as ψζ(f):=Ei∼ζ[ψi(f)]. This is just mixing the functions together!Both of these ways of thinking of mixtures of infradistributions are equivalent, and recover mixture of probability distributions as a special case.
The concave functional view is “the thing you do with a probability distribution is take expectations of functions with it. In fact, it’s actually possible to identify a probability distribution with the function (X→R)→R mapping a function to its expectation. Similarly, the thing we do with an infradistribution is taking expectations of functions with it. Let’s just look at the behavior of the function (X→R)→R we get, and neglect the view of everything as a set of a-measures.”As it turns out, this view makes proofs a whole lot cleaner and tidier, and you only need a few conditions on a function like that for it to have a corresponding set of a-measures.