# abramdemski(Abram Demski)

Karma: 17,578
• The fact that I could gamble more wisely if I had access to more computation doesn’t seem to undercut the reasons for using probabilities when I don’t.

I am not trying to undercut the use of probability in the broad sense of using numbers to represent degrees of belief.

However, if “probability” means “the kolmogorov axioms”, we can easily undercut these by the argument you mention: we can consider a (quite realistic!) case where we don’t have enough computational power to enforce the kolmogorov axioms precisely. We conclude that we should avoid easily-computed dutch books, but may be vulnerable to some hard-to-compute dutch books.

Now in the extreme adversarial case, a bookie could come along who knows my computational limits and only offers me bets where I lose in expectation. But this is also a problem for empirical uncertainty; in both cases, if you literally face a bookie who is consistently winning money from you, you could eventually infer that they know more than you and stop accepting their bets. I still see no fundamental difference between empirical and logical uncertainties.

Yes, exactly. In the perspective I am offering, the only difference between bookies who we stop betting with due to a history of losing money, vs bookies we stop betting with due to a priori knowing better, is that the second kind corresponds to something we already knew (already had high prior weight on).

In the classical story, however, there are bookies we avoid a priori as a matter of logic alone (we could say that the classical perspective insists that the kolmogorov axioms are known a priori—which is completely fine and good if you’ve got the computational power to do it).

• Here’s an explanation that may help.

You can think of classical Bayesian reasoning as justified by Dutch Book arguments. However, for a Dutch Book argument to be convincing, there’s an important condition that we need: the bookie needs to be just as ignorant as the agent. If the bookie makes money off the agent because the bookie knows an insider secret about the horse race, we don’t think of this as “irrational” on the part of the agent.

This assumption is typically packaged into the part of a Dutch Book argument where we say the Dutch Book “guarantees a net loss”—if the bookie is using insider knowledge, then it’s not a “guarantee” of a net loss. This “guarantee” needs to be made with respect to all the ways things could empirically turn out.

However, this distinction becomes fuzzier when we consider embedded agency, and in particular, computational uncertainty. If the agent has observed the length of two sides of a right triangle, then it is possible to compute the length of the remaining side. Should we say, on the one hand, that there is a Dutch Book against agents who do not correctly compute this third length? Or should we complain that a bookie who has completed the computation has special insider knowledge, which our agent may lack due to not having completed the computation?

If we bite the “no principled distinction” bullet, we can develop a theory where we learn to avoid making logical mistakes (such as classical Dutch Books, or the triangle example) in exactly the same manner that we learn to avoid empirical mistakes (such as learning that the sun rises every morning). Instead of getting a guarantee that we never give in to a Dutch Book, we get a bounded-violations guarantee; we can only lose so much money that way before we wise up.

• 6 Sep 2024 14:43 UTC
LW: 2 AF: 2
0
AF
in reply to: Tao Lin’s comment

Yeah, in hindsight I realize that my iterated mugging scenario only communicates the intuition to people who already have it. The Lizard World example seems more motivating.

• 6 Sep 2024 14:40 UTC
LW: 2 AF: 2
0
AF
in reply to: Bunthut’s comment on: FixDT

You can do exploration, but the problem is that (unless you explore into non-fixed-point regions, violating epistemic constraints) your exploration can never confirm the existence of a fixed point which you didn’t previously believe in. However, I agree that the situation is analogous to the handstand example, assuming it’s true that you’d never try the handstand. My sense is that the difficulties I describe here are “just the way it is” and only count against FixDT in the sense that we’d be happier with FixDT if somehow these difficulties weren’t present.

I think your idea for how to find repulsive fixed-points could work if there’s a trader who can guess the location of the repulsive point exactly rather than approximately, and has the wealth to precisely enforce that belief on the market. However, the wealth of that trader will act like a martingale; there’s no reliable profit to be made (even on average) by enforcing this fixed point. Therefore, such a trader will go broke eventually. On the other hand, attractive fixed points allow profit to be made (on average) by approximately guessing their locations.

Repulsive points effectively “drain willpower”.

• (Some might believe that most people are mediocre, and their music would end up bland and similar to each other’s — but if no one else is in the feedback loop, how could that even happen? And if it did, what would this neutral, universally human music look like? A folk song from Borneo? Classical music? Hip-hop?)

This doesn’t seem like the default to me. The default is AI companies that do centralized work trying to make a good product. All the users are in the feedback loop. Some customization to individual users is valuable, but the prior that’s been developed through interaction with lots of people is going to do a ton of the work. Your intuition that music becomes super-individualized seems based on an intuition that the AI customization “grows with you”, going deep down a rabbit hole over years. This doesn’t seem like the sort of thing the companies are incentivized to create. The experience to new users is much more important for adoption.

• I think so, yes, but I want to note that my position is consistent with nosy-neighbor hypotheses not making sense. A big part of my point is that there’s a lot of nonsense in a broad prior. I think it’s hard to rule out the nonsense without learning. If someone thought nosy neighbors always ‘make sense’, it could be an argument against my whole position. (Because that person might be just fine with UDT, thinking that my nosy-neighbor ‘problems’ are just counterfactual muggings.)

Here’s an argument that nosy neighbors can make sense.

For values, as I mentioned, a nosy-neighbors hypothesis is a value system which cares about what happens in many different universes, not just the ‘actual’ universe. For example, a utility function which assigns some value to statements of mathematics.

For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it’s a world where what happens there depends a lot on what happens in other worlds.

I think what you wrote about staples vs paperclips nosy-neighbors is basically right, but maybe if we rephrase it it can ‘make more sense’?: “I (actual me) value paperclips being produced in the counterfactual(-from-my-perspective) world where I (counterfactual me) don’t value paperclips.”

Anyway, whether or not it makes intuitive sense, it’s mathematically fine. The idea is that a world will contain facts that are a good lens into alternative worlds (such as facts of Peano Arithmetic), which utility hypotheses /​ probabilistic hypotheses can care about. So although a hypothesis is only mathematically defined as a function of worlds where it holds, it “sneakily” depends on stuff that goes on in other worlds as well.

• I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can’t do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it’s misleading to imply we can surmount it. It’s great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I’ve felt this was obscured in many relevant conversations.

I don’t get your disagreement. If your view is that you can’t eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn’t the obvious conclusion that these two views are compatible?

I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view is mainly about iterated situations (more than one cake).

Maybe your disagreement would be better stated in a way that didn’t lean on the cake analogy?

My point is that the theoretical work you are shooting for is so general that it’s closer to “what sorts of AI designs (priors and decision theories) should always be implemented”, rather than “what sorts of AI designs should humans in particular, in this particular environment, implement”.
And I think we won’t gain insights on the former, because there are no general solutions, due to fundamental trade-offs (“no-free-lunchs”).
I think we could gain many insights on the former, but that the methods better fit for that are less formal/​theoretical and way messier/​”eye-balling”/​iterating.

Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.

Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful.

It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don’t think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the ‘essential picture’ is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)

• This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can’t differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on “our own beliefs” or “which beliefs I endorse”? After all, that’s just one more part of reality (without a clear boundary separating it).

I’m comfortable explicitly assuming this isn’t the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility ‘somewhat sanely’.

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can’t know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: “I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I’m in the Infinite Counterlogical Mugging… then I will just eventually change my prior because I noticed I’m in the bad world!”. But then again, why would we think this update is safe? That’s just not being updateless, and losing out on the strategic gains from not updating.

My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also ‘not sacrificing too much’ in particular branches.

That is: in this case at least it seems like there’s concrete reason to believe we can have some cake and eat some too.

Since a solution doesn’t exist in full generality, I think we should pivot to more concrete work related to the “content” (our particular human priors and our particular environment) instead of the “formalism”.

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I’m using in my arguments. I’m more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I’m interested in doing work to help navigate is the tiling problem.

• You’re right, I was overstating there. I don’t think it’s probable that everything cancels out, but a more realistic statement might be something like “if UDT starts with a broad prior which wasn’t designed to address this concern, there will probably be many situations where its actions are more influenced by alternative possibilities (delusional, from our perspective) than by what it knows about the branch that it is in”.

• Let’s frame it in terms of value learning.

Naive position: UDT can’t be combined with value learning, since UDT doesn’t learn. If we’re not sure whether puppies or rainbows are what we intrinsically value, but rainbows are easier to manufacture, then the superintelligent UDT will tile the universe with rainbows instead of puppies because that has higher expectation according to the prior, regardless of evidence it encounters suggesting that puppies are what’s more valuable.

Cousin_it’s reply: There’s puppy-world and rainbow-world. In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.

The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies:

• Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.

• EV: 45

• Always make rainbows: 50% chance of utility 100, otherwise zero.

• EV: 50

• Make puppies in rainbow world; make rainbows in puppy world.

• EV: 0

• Make puppies in puppy world, make rainbows in rainbow world.

• EV: 95

The highest EV is to do the obvious value-learning thing; so, there’s no problem.

Fixing the naive position: Some hypotheses will “play nice” like the example above, and updateless value learning will work fine.

However, there are some versions of “valuing puppies” and “valuing rainbows” which value puppies/​rainbows regardless of which universe the puppies/​rainbows are in. This only requires that there’s some sort of embedding of counterfactual information into the sigma-algebra which the utility functions are predicated on. For example, if the agent has beliefs about PA, these utility functions could check for the number of puppies/​rainbows in arbitrary computations. This mostly won’t matter, because the agent doesn’t have any control over arbitrary computations; but some of the computations contemplated in Rainbow Universe will be good models of Puppy Universe. Such a rainbow-value-hypothesis will value policies which create rainbows over puppies regardless of which branch they do it in.

These utility functions are called “nosy neighbors” because they care about what happens in other realities, not just their own.

Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I’ll assume they’re nosy enough that they value puppies/​rainbows in other universes exactly as much as in their own. There are four policies:

• Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.

• EV: 90

• Always make rainbows: 50% worthless, 50% worth 100 + 100.

• EV: 100

• Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.

• EV: 95

• Puppies in puppy universe, rainbown in rainbow universe:

• EV: 95

In the presence of nosy neighbors, the naive position is vindicated: UDT doesn’t do “value learning”.

The argument is similar for the case of ‘learning the correct prior’. The problem is that if we start with a broad prior over possible priors, then there can be nosy-neighbor hypotheses which spoil the learning. These are hard to rule out, because it is hard to rule out simulations of other possible worlds.

• Here’s a different way of framing it: if we don’t make this assumption, is there some useful generalization of UDT which emerges? Or, having not made this assumption, are we stuck in a quagmire where we can’t really say anything useful?

I think about these sorts of ‘technical assumptions’ needed for nice DT results as “sanity checks”:

• I think we need to make several significant assumptions like this in order to get nice theoretical DT results.

• These nice DT results won’t precisely apply to the real world; however, they do show that the DT being analyzed at least behaves sanely when it is in these ‘easier’ cases.

• So it seems like the natural thing to do is prove tiling results, learning results, etc under the necessary technical assumptions, with some concern for how restrictive the assumptions are (broader sanity checks being better), and then also, check whether behavior is “at least somewhat reasonable” in other cases.

So if UDT fails to tile when we remove these assumptions, but, at least appears to choose its successor in a reasonable way given the situation, this would count as a success.

Better, of course, if we can find the more general DT which tiles under weaker assumptions. I do think it’s quite plausible that UDT needs to be generalized; I just expect my generalization of UDT will still need to make an assumption which rules out your counterexample to UDT.

• Yeah, I’m kind of connecting a lot of threads here in a messy way. This post definitely could be better-organized.

• I have a sense that open-minded UDT should relate to objective probabilities in a frequentist sense. For example, in decision problems involving Omega, it’s particularly compelling if we stipulate that Omega has a long history of offering similar choices to mortals and a track record of being honest and predicting correctly. This is in some sense the most compelling way we can come to know what decision problem we face; and, it relies on framing our decision as part of a sequence. Counterlogical mugging on a digit of pi is similarly compelling if we imagine a very large digit, but becomes less compelling as we imagine digits closer to the beginning of pi. I want to suggest a learning principle with frequentist-flavored guarantees (similar to LIDT or BRIA but less updateful).

• On the other hand, the bargaining framing does not have anything to do with iteration. The bargaining idea in some sense feels much more promising, since I can already offer a toy analysis supporting my intuition that iterated counterfactual mugging with the some coin is less tempting than iterated muggings with different coins.

• 13 Aug 2024 15:57 UTC
LW: 4 AF: 4
0
AF
in reply to: Wei Dai’s comment

for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life.

Yeah, this is an important point, but I think UDT has it significantly worse. For one thing, UDT has the problem I mention on top of the problem you mention. But more importantly, I think the problem I mention is less tractable than the problem you mention.

EDIT: I’ve edited the essay to name my problem as “lizard worlds” (lizards reward updateful policies). So I’ll call the issue you raise the heaven/​hell problem, and the issue I raise the lizard world problem.

For updateful DT, we can at least say: yes, a broad prior will include heaven/​hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:

• The prior includes a likelihood function for heaven/​hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.

• We mostly[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/​hell hypotheses when there’s good reason, or else penalize heaven/​hell hypotheses a priori for having a higher description complexity.

• We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.[2]

None of these methods help UDT address the lizard world problem:

• The likelihood functions don’t matter; only the prior probability matters.

• Simplicity priors aren’t especially going to rule out these alternative worlds.[3]

• Direct feedback we give about expected values doesn’t reduce the prior weight of these problematic hypotheses.

So I think there are definitely problems in this area, but I’m not sure it has much to do with “learning” as opposed to “philosophy” and the examples /​ thought experiments you give don’t seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)

Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.

I do agree that “philosophy” problems are very close to this stuff, and it would be good to articulate in those terms.

1. ^

Modulo inner-optimizer concerns like simulation attacks.

2. ^

I’m imagining something like Bayesian RL, or Bayesian approval-directed agents, but perhaps with the twist that feedback is only sometimes given.

3. ^

This is somewhat nuanced/​questionable. The lizards might be providing rewards/​punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn’t have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/​punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.

Perhaps it doesn’t have to weaken the rewards/​punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard’s offer, of course, which is arguably a counterfactual mugging itself).

Also, I think there are more complications related to inner optimizers.

• I probably need to clarify the statement of the assumption.

The idea isn’t that it eventually takes at least one action that’s in line with for each n, but then, might do some other stuff. The idea is that for each , there is a time after which all decisions will be consistent with UDT-using-.

UDT’s recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.

• I haven’t analyzed your argument yet, but: tiling arguments will always depend on assumptions. Really, it’s a question of when something tiles, not whether. So, if you’ve got a counterexample to tiling, a natural next question is what assumptions we could make to rule it out, and how unfortunate it is to need those assumptions.

I might not have understood adequately, yet, but it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action. This is a big assumption, but at the same time, the sort of assumption I would expect to need in order to justify UDT. As Eliezer put it, tiling results need to assume that the environment only cares about what policy we implement, not our “rituals of cognition” that compute those policies. An earlier act of self-modification vs a later decision is a difference in “ritual of cognition” as opposed to a difference in the policy, to me.

So, I need to understand the argument better, but it seems to me like this kind of counterexample doesn’t significantly wound the spirit of UDT.

# In Defense of Open-Minded UDT

12 Aug 2024 18:27 UTC
72 points