# johnswentworth

Karma: 891

NewTop

# Don’t Get Distracted by the Boilerplate

No matter what decision you make, it seems that you will inevitably regret it.

It’s not exactly a puzzle that game theory doesn’t always give pure solutions. This puzzle should still have a solution in mixed strategies, assuming the genie can’t predict quantum random number generators.

Bernstein-Von Mises Theorem. It is indeed not

*always*true, the theorem has some conditions.An intuitive example of where it would fail: suppose we are rolling a (possibly weighted) die, but we model it as drawing numbered balls from a box without replacement. If we roll a bunch of sixes, then the model thinks the box now contains fewer sixes, so the chance of a six is lower. If we modeled the weighted die correctly, then a bunch of sixes is evidence that’s it’s weighted toward six, so the chance of six should be higher.

Takeaway: Bernstein-Von Mises typically fails in cases where we’re restricting ourselves to a badly inaccurate model. You can look at the exact conditions yourself; as a general rule, we want those conditions to hold. I don’t think it’s a significant issue for my argument.

We could set up the IRL algorithm so that atom-level simulation is outside the space of models it considers. That would break my argument. But a limitation on the model space like that raises other issues, especially for FAI.

Problem is, if there’s a sufficiently large amount of sufficiently precise data, then the physically-correct model’s high accuracy is going to swamp the complexity penalty. That would be a ridiculously huge amount of data for atom-level physics, but there could be other abstraction levels which require less data but are still not what we want (e.g. gene-level reward functions, though that doesn’t fit the driving example very well).

Also, reliance on limited data seems like the sort of thing which is A Bad Idea for friendly AGI purposes.

Wouldn’t the reward function “maximize action for this configuration of atoms” fit the data really well (given unrealistic computational power), but produce unhelpful prescriptions for behavior outside the training set? I’m not seeing how IRL dodges the problem, other than the human manipulating the algorithm (effectively choosing a prior).

# ISO: Name of Problem

Chapter 6 of Cover & Thomas’ “Elements of Information Theory” gives good info on the Kelly criterion, how to derive it, and the relations between prices/probabilities and entropy/rate of return.

For math finance, the class I took back in college used Shreve’s “Stochastic Calculus for Finance II”. I wouldn’t necessarily recommend that just to learn about this, but it’s a good source for brownian motion, some basic measure theory, and the core theory of asset pricing.

Typically complete markets come up in discussing the fundamental theorem of asset pricing. The first part of the theorem says that any arbitrage-free set of asset prices has a “risk-neutral measure”, i.e. a market-implied set of probabilities. The second part says those probabilities are unique iff the market is complete—if some bets can’t be placed, then there are multiple possible market-implied probabilities. Any book which covers the fundamental theorem should have at least some coverage of complete markets.

Finally, if you’re looking for something more applied, Hull’s “Options, Futures and Other Derivatives” is the usual starting point.

Tl;dr: The problem is that we have no way to bet on joint outcomes. If we add bets on joint outcomes, then the market is complete, we can combine the two outcomes into a single joint outcome, and Kelly criteria should work. To properly break Kelly, we need bets which resolve at different times.

This hits on a critical point which is fundamental to mathematical finance, but virtually unknown outside of it: complete markets. A “complete market” is one in which we can place any possible bet on whatever random variables are involved.

For instance, if we have a stock market with nothing but a single stock, and we’re betting on the stock’s price in the next time-step, then that’s an incomplete market: we have no way to place a bet which pays $1 if the price ends up within some window, and $0 otherwise. On the other hand, if we add in the full option chain (call options at every possible price), then the market is complete. We can pick a portfolio of options to make any possible bet on the stock’s price next timestep.

Mathematically, incomplete markets are a mess. You can’t get the bet you actually want, so you’re stuck trying to approximate it with the available bets, and that approximation gets messy.

On the other hand, if you do have complete markets, then you can combine everything into a single random variable and just use the Kelly criterion.

# Letting Go III: Unilateral or GTFO

From the wording of this post it sounds like you made up the term “Definition-Theorem-Proof”? That would be quite amusing, because that’s the standard term used for this style of textbooks.

There is a great schism in mathematics between mathematical physicists/applied mathematicians/intuitionists, and pure mathematicians/Bourbaki. The DTP style is strongly characteristic of the latter, and much-bemoaned by the former.

Also sorry I didn’t actually answer your main question. It’s actually something I’ve thought about quite a bit, but usually in the context of “not enough data to map out this very-high-dimensional space” rather than “not enough data to detect a small change”. The problem is similar in both cases. I’ll probably write a post or two on it at some point, but here’s a very short summary.

Traditional probability theory relies heavily on large-number approximations; mainstream statistics uses convergence as its main criterion of validity. Small data problems, on the other hand, are much better suited to a Bayesian approach. In particular, if we have a few different models (call them ) and some data , we can compute the posterior without having to talk about convergence or large numbers at all.

The trade-off is that the math tends to be spectacularly hairy; usually involves high-dimensional integrals. Traditional approaches approximate those integrals for large numbers of data points, but the whole point here is that we don’t have enough data for the approximations to be valid.

Under common law, lawmakers give up control over the details of the law. Details of interpretation and application, all the little edge cases, precise definitions… that’s decided mainly by courts.

The closest analogue among the other examples is declarative vs imperative programming: think of lawmakers as the programmers, and courts as the compiler. Just as programmers give up control over the details of their program’s execution to the compiler, lawmakers give up control over the details of the law to the courts.

# Letting Go II: Understanding is Key

And yet, 10% changes matter. Stack 10 of them and you’ve doubled whatever you were trying to improve.

Counterargument:

^{80}⁄_{20}rule suggests that, most of the time in practice, either the 10% changes won’t actually stack, or one of them will contribute most of the value on its own. It’s really hard to find 10 changes of 10% each which actually stack.

# The Power of Letting Go Part I: Examples

Could you give a few examples of what you mean by working with the duals, both in the maze context and otherwise? It brings at least one good maze strategy to mind for me, but the word is used in multiple ways, so I’m curious whether we’re thinking of similar things.

No idea. Just off the top of my head, the exponential growth in volume would be an issue for mazes in higher dimensions. Four would probably still be workable, though.

Thanks! Glad it made sense, I wasn’t sure it would.

Yeah, on reflection that’s right, I didn’t think it through properly.

I think this is related to a general class of mistakes, so I just wrote up a post on it.

This case is a bit different from what that post discusses, in that you’re not focused on a non-critical assumption, but on a non-critical method. We can use VNM rationality for decision-making just fine without computing full utilities for every decision; we just need to compute enough to be confident that we’re making the higher-utility choice. For that purpose we can use tricks like e.g. changing the unit of valuation on the fly, making approximations (as long as we keep track of the error bars), etc.