Yes, the relevant probability is conditional on having been offered the deal. I still have no idea how this could possibly help.
Let’s say that your utility is logarithmic in X, the years of fun. What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows. But the number of bits in the program “offer n^^^n years of fun, and grant this iff subject pays up” grows only linearly with n, while log(n^^^n) grows, well, superexponentially. Normalizing by the total weight of all programs that offer the deal makes the probability go further up, not down.
What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows
No.
In order to avoid the mugging you need just that P(more than X years of fun iff I pay up | deal for X years offered) (log(X) - MarginalU($5)) < P(Y years of fun iff I pay up | deal for X years offered) (log(Y) - MarginalU($5)) where Y is the number of years of fun you will get anyway if the mugger gives you nothing (assuming, for simplicity, that the mugger either gives you X or more years of fun or gives you nothing, but the argument can be generalized to scenarios where the mugger may give you Z < X years of fun)
the number of bits in the program “offer n^^^n years of fun, and grant this iff subject pays up” grows only linearly with n
That seems incorrect.
I think that the source of your confusion stems from the fact that the program length of the (shortest) program “write n^^^n on the tape” is K(n^^^n) ⇐ log_2(n) . But the length of programs consistent with “offer n^^^n years of fun, and grant this iff subject pays up” must grow faster than K(n^^^n).
Proof by reductio ad absurdum: Suppose that length of the shortest program A(n^^^n) = “offer n^^^n years of fun, and grant this iff subject pays up” grows with K(n^^^n). Then, up to some additive constant, Len(A(n^^^n)) is the length of the concatenation of two shortest programs: W(n^^^n) = “write n^^^n on the tape” and B = “read X from the tape, offer X years of fun, and grant this iff subject pays up” where Len(W(n^^^n)) = K(n^^^n) and Len(B) is a constant independent of n^^^n.
Consider the posterior: post = p(grant this iff subject pays up | offer n^^^n years of fun) = p(“offer n^^^n years of fun, and grant this iff subject pays up”) / p(“offer n^^^n years of fun”)
Define the shortest program O(n^^^n) = “offer n^^^n years of fun”. Again, if Len(O(n^^^n)) grows with K(n^^^n), then up to some additive constant Len(O(n^^^n)) is the length of the concatenation of W(n^^^n) and program C = “read X from the tape, offer X years of fun” which doesn’t depend on n^^^n.
That is, Len(A(n^^^n)) = K(n^^^n) + Len(B) Len(O(n^^^n)) = K(n^^^n) + Len(C)
As you can see this posterior doesn’t depend on n^^^n or even on n, which is clearly inconsistent with the notion (formalized in a theorem by Solomonoff) that Solomonoff induction learns an accurate model of the environment. Therefore, the assumption that there is a shortest program consistent with “offer n^^^n years of fun, and grant this iff subject pays up” whose length grows with K(n^^^n) must be incorrect.
What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows
No.
Okay, you’re saying that as X goes up, the probability of getting X years of fun even if you don’t pay up also goes up, because any program that offers a deal of X years has to include a specification of the number X? So the expected utility of not paying up doesn’t stay constant as we vary X, but increases with X (at least asymptotically, for very large X)?
Well, you’re right on that, and that’s in fact a point I hadn’t considered, thanks. But I was replying to this:
In particular, if Solomonoff induction assigns to models where starting from now you get X years of fun a total probabilty p(X) that asymptotically decreases fast enough with X so that U(X) * p(X) also decreases, it will not be subject to Pascal’s mugging.
If by this you mean something else than that P(more than X years of fun iff I pay up | deal for X years offered) log(X) → 0, then I don’t understand what you mean by p(X). If you mean e.g. p(X) = P(deal for X years offered), then why would p(X) U(X) → 0 avoid Pascal’s mugging? (Not that it does go to zero.)
As you can see this posterior doesn’t depend on n^^^n or even on n, which is clearly inconsistent with the notion (formalized in a theorem by Solomonoff) that Solomonoff induction learns an accurate model of the environment.
That theorem says, roughly (actually I’m just giving a particular consequence), that given a particular world program, after seeing a certain finite number of bits produced by this program, Solomonoff induction will predict all future bits correctly. The particular number of bits needed initially depends, of course, on the program. (More generally: Given a computable probability distribution over world programs, with probability one there is some number of bits after which Solomonoff induction’s conditional distribution over subsequent bits will equal the “true” conditional distribution.) Of course, Solomonoff’s theorem only allows the agent to observe the environment, not interact with it, but that doesn’t seem to be the issue here (we can consider Hutter’s variants instead).
You are not keeping the world program fixed, and for each world program considered, you only talk about what happens after the agent has a certain number of bits (which is fixed given the world program, i.e. you don’t let it tend to infinity), so the theorem does not apply.
What antecedent do you want to deny in your argument, anyway? If your argument worked, it would still work if we replaced “grant X years of fun” by “write the symbol FUN on the tape X times”. But there is certainly a program B that reads X from the internal tape, writes X to the output tape, reads a symbol from the input tape, and writes FUN X times iff the input symbol is ACCEPT, and similarly for C and W(n^^^n).
Yes, the relevant probability is conditional on having been offered the deal. I still have no idea how this could possibly help.
Let’s say that your utility is logarithmic in X, the years of fun. What you need to avoid the mugging is that P(more than X years of fun iff I pay up | deal for X years offered) goes to zero faster than log(X) grows. But the number of bits in the program “offer n^^^n years of fun, and grant this iff subject pays up” grows only linearly with n, while log(n^^^n) grows, well, superexponentially. Normalizing by the total weight of all programs that offer the deal makes the probability go further up, not down.
No.
In order to avoid the mugging you need just that
P(more than X years of fun iff I pay up | deal for X years offered) (log(X) - MarginalU($5)) < P(Y years of fun iff I pay up | deal for X years offered) (log(Y) - MarginalU($5))
where Y is the number of years of fun you will get anyway if the mugger gives you nothing (assuming, for simplicity, that the mugger either gives you X or more years of fun or gives you nothing, but the argument can be generalized to scenarios where the mugger may give you Z < X years of fun)
That seems incorrect.
I think that the source of your confusion stems from the fact that the program length of the (shortest) program “write n^^^n on the tape” is K(n^^^n) ⇐ log_2(n) .
But the length of programs consistent with “offer n^^^n years of fun, and grant this iff subject pays up” must grow faster than K(n^^^n).
Proof by reductio ad absurdum:
Suppose that length of the shortest program A(n^^^n) = “offer n^^^n years of fun, and grant this iff subject pays up” grows with K(n^^^n).
Then, up to some additive constant, Len(A(n^^^n)) is the length of the concatenation of two shortest programs:
W(n^^^n) = “write n^^^n on the tape” and
B = “read X from the tape, offer X years of fun, and grant this iff subject pays up”
where Len(W(n^^^n)) = K(n^^^n) and Len(B) is a constant independent of n^^^n.
Consider the posterior:
post = p(grant this iff subject pays up | offer n^^^n years of fun) = p(“offer n^^^n years of fun, and grant this iff subject pays up”) / p(“offer n^^^n years of fun”)
Define the shortest program O(n^^^n) = “offer n^^^n years of fun”.
Again, if Len(O(n^^^n)) grows with K(n^^^n), then up to some additive constant Len(O(n^^^n)) is the length of the concatenation of W(n^^^n) and program C = “read X from the tape, offer X years of fun” which doesn’t depend on n^^^n.
That is,
Len(A(n^^^n)) = K(n^^^n) + Len(B)
Len(O(n^^^n)) = K(n^^^n) + Len(C)
Therefore
post =~= 2^-(Len(A(n^^^n)) - Len(O(n^^^n))) =
= 2^-((K(n^^^n) + Len(B)) - (K(n^^^n) + Len(C))) =
= 2^-(Len(B) - Len(C))
As you can see this posterior doesn’t depend on n^^^n or even on n, which is clearly inconsistent with the notion (formalized in a theorem by Solomonoff) that Solomonoff induction learns an accurate model of the environment.
Therefore, the assumption that there is a shortest program consistent with “offer n^^^n years of fun, and grant this iff subject pays up” whose length grows with K(n^^^n) must be incorrect.
Okay, you’re saying that as X goes up, the probability of getting X years of fun even if you don’t pay up also goes up, because any program that offers a deal of X years has to include a specification of the number X? So the expected utility of not paying up doesn’t stay constant as we vary X, but increases with X (at least asymptotically, for very large X)?
Well, you’re right on that, and that’s in fact a point I hadn’t considered, thanks. But I was replying to this:
If by this you mean something else than that P(more than X years of fun iff I pay up | deal for X years offered) log(X) → 0, then I don’t understand what you mean by p(X). If you mean e.g. p(X) = P(deal for X years offered), then why would p(X) U(X) → 0 avoid Pascal’s mugging? (Not that it does go to zero.)
That theorem says, roughly (actually I’m just giving a particular consequence), that given a particular world program, after seeing a certain finite number of bits produced by this program, Solomonoff induction will predict all future bits correctly. The particular number of bits needed initially depends, of course, on the program. (More generally: Given a computable probability distribution over world programs, with probability one there is some number of bits after which Solomonoff induction’s conditional distribution over subsequent bits will equal the “true” conditional distribution.) Of course, Solomonoff’s theorem only allows the agent to observe the environment, not interact with it, but that doesn’t seem to be the issue here (we can consider Hutter’s variants instead).
You are not keeping the world program fixed, and for each world program considered, you only talk about what happens after the agent has a certain number of bits (which is fixed given the world program, i.e. you don’t let it tend to infinity), so the theorem does not apply.
What antecedent do you want to deny in your argument, anyway? If your argument worked, it would still work if we replaced “grant X years of fun” by “write the symbol FUN on the tape X times”. But there is certainly a program B that reads X from the internal tape, writes X to the output tape, reads a symbol from the input tape, and writes FUN X times iff the input symbol is ACCEPT, and similarly for C and W(n^^^n).