# Ege Erdil(Ege Erdil)

Karma: 157

You can find me on Metaculus at https://​​www.metaculus.com/​​accounts/​​profile/​​116023/​​.

• Ah, I see. I missed that part of the post for some reason.

In this setup the update you’re doing is fine, but I think measuring the evidence for the hypothesis in terms of “bits” can still mislead people here. You’ve tuned your example so that the likelihood ratio is equal to two and there are only two possible outcomes, while in general there’s no reason for those two values to be equal.

• This is a rather pedantic remark that doesn’t have much relevance to the primary content of the post (EDIT: it’s also based on a misunderstanding of what the post is actually doing—I missed that an explicit prior is specified which invalidates the concern raised here), but

If such a coin is flipped ten times by someone who doesn’t make literally false statements, who then reports that the 4th, 6th, and 9th flips came up Heads, then the update to our beliefs about the coin depends on what algorithm the not-lying[1] reporter used to decide to report those flips in particular. If they always report the 4th, 6th, and 9th flips independently of the flip outcomes—if there’s no evidential entanglement between the flip outcomes and the choice of which flips get reported—then reported flip-outcomes can be treated the same as flips you observed yourself: three Headses is 3 * 1 = 3 bits of evidence in favor of the hypothesis that the coin is Heads-biased. (So if we were initially 50:50 on the question of which way the coin is biased, our posterior odds after collecting 3 bits of evidence for a Heads-biased coin would be 23:1 = 8:1, or a probability of 8/​(1 + 8) ≈ 0.89 that the coin is Heads-biased.)

is not how Bayesian updating would work in this setting. As I’ve explained in my post about Laplace’s rule of succession, if you start with a uniform prior over for the probability of the coin coming up heads and you observe a sequence of heads in succession, you would update to a posterior of which has mean . For that would be rather than .

I haven’t formalized this, but one problem with the entropy approach here is that the distinct bits of information you get about the coin are actually not independent, so they are worth less than one bit each. They aren’t independent because if you know some of them came up heads, your prior that the other ones also came up heads will be higher, since you’ll infer that the coin is likely to have been biased in the direction of coming up heads.

To not leave this totally up in the air, if you think of the th heads having an information content of

bits, then the total information you get from heads is something like

bits instead of bits. Neglecting this effect leads you to make much more extreme inferences than would be justified by Bayes’ rule.

• Yeah, Neyman’s proof of Laplace’s version of the rule of succession is nice. The reason I think this kind of approach can’t give the full strength of the conjugate prior approach is that I think there’s a kind of “irreducible complexity” to computing for non-integer values of . The only easy proof I know goes through the connection to the gamma function. If you stick only to integer values there are easier ways of doing the computation, and the linearity of expectation argument given by Neyman is one way to do it.

One concrete example of the rule being used in practice I can think of right now is this comment by SimonM on Metaculus.

# Laplace’s rule of succession

23 Nov 2021 15:48 UTC
36 points

1. What matters is that it’s something you can invest in. Choosing the S&P 500 is not really that important in particular. There doesn’t have to be a single company whose stock is perfectly correlated with the S&P 500 (though nowadays we have ETFs which more or less serve this purpose) - you can simply create your own value-weighted stock index and rebalance it on a daily or weekly basis to adjust for the changing weights over time, and nothing will change about the main arguments. This is actually what the authors of The Rate of Return on Everything do in the paper, since we don’t really have good value-weighted benchmark indices for stocks going back to 1870.

The general point (which I hint at but don’t make in the post) is that we persistently see high Sharpe ratios in asset markets. The article I cite at the start of the post also has data on real estate returns, for example, which exhibit an even stronger puzzle because they are comparable to stock returns in real terms but have half the volatility.

2. I don’t know the answer to your exact question, but a lot of governments have bonds which are quite risky and so this comparison wouldn’t be appropriate for them. If you think of the real yield of bonds as consisting of a time preference rate plus some risk premium (which is not a perfect model but not too far off), the rate of return on any one country’s bonds puts an upper bound on the risk-free rate of return. Therefore we don’t need to think about investing in countries whose bonds are risky assets in order to put a lower bound on the size of the equity premium relative to a risk-free benchmark.

3. This only has a negligible effect because the returns are inflation-adjusted and over long time horizons any real exchange rate deviation from the purchasing power parity benchmark is going to be small relative to the size of the returns we’re talking about. Phrased another way; inflation-adjusted stock prices are not stationary whereas real exchange rates are stationary, so as long as the time horizon is long enough you can ignore exchange rate effects so long as you perform inflation adjustment.

4. This is an interesting question and I don’t know the answer to it. Partly this is because we don’t really understand where the equity premium is coming from to begin with, so thinking about how some hypothetical change in the human condition would alter its size is not trivial. I think different models of the equity premium actually make different predictions about what would happen in such a situation.

It’s important, though, to keep in mind that the equity premium is not about the rate of time preference: risk-free rates of return are already quite low in our world of mortal people. It’s more about the volatility of marginal utility growth, and there’s no logical connection between that and the time for which people are alive. One of the most striking illustrations of that is Campbell and Cochrane’s habit formation model of the equity premium, which produces a long-run equity premium even at infinite time horizons, something a lot of other models of the equity premium struggle with.

I think in the real world if people became immortal the long-run (or average) equity premium would fall, but the short-run equity premium would still sometimes be high, in particular in times of economic difficulty.

• Over 20 years that’s possible (and I think it’s in fact true), but the paper I cite in the post gives some data which makes it unlikely that the whole past record is outperformance. It’s hard to square 150 years of over 6% mean annual equity premium with 20% annual standard deviation with the idea that the true stock return is actually the same as the return on T-bills. The “true” premium might be lower than 6% but not by too much, and we’re still left with more or less the same puzzle even if we assume that.

# Equity pre­mium puzzles

16 Nov 2021 20:50 UTC
16 points
(www.metaculus.com)
• That’s alright, it’s partly on me for not being clear enough in my original comment.

I think information aggregation from different experts is in general a nontrivial and context-dependent problem. If you’re trying to actually add up different forecasts to obtain some composite result it’s probably better to average probabilities; but aside from my toy model in the original comment, “field data” from Metaculus also backs up the idea that on single binary questions median forecasts or log odds average consistently beats probability averages.

I agree with SimonM that the question of which aggregation method is best has to be answered empirically in specific contexts and theoretical arguments or models (including mine) are at best weakly informative about that.

• I don’t know what you’re talking about here. You don’t need any nonlinear functions to recover the probability. The probability implied by is just , and the probability you should forecast having seen is therefore

since is a martingale.

I think you don’t really understand what my example is doing. is not a Brownian motion and its increments are not Gaussian; it’s a nonlinear transform of a drift-diffusion process by a sigmoid which takes values in . itself is already a martingale so you don’t need to apply any nonlinear transformation to M on top of that in order to recover any probabilities.

The explicit definition is that you take an underlying drift-diffusion process Y following

and let . You can check that this is a martingale by using Ito’s lemma.

If you’re still not convinced, you can actually use my Python script in the original comment to obtain calibration data for the experts using Monte Carlo simulations. If you do that, you’ll notice that they are well calibrated and not overconfident.

• Thanks for the comment—I’m glad people don’t take what I said at face value, since it’s often not correct...

What I actually maximized is (something like, though not quite) the expected value of the logarithm of the return, i.e. what you’d do if you used the Kelly criterion. This is the correct way to maximize long-run expected returns, but it’s not the same thing as maximizing expected returns over any given time horizon.

My computation of is correct, but the problem comes in elsewhere. Obviously if your goal is to just maximize expected return then we have

and to maximize this we would just want to push as high as possible as long as , regardless of the horizon at which we would be rebalancing. However, it turns out that this is perfectly consistent with

where is the ideal leveraged portfolio in my comment and is the actual one, both with k-fold leverage. So the leverage decay term is actually correct, the problem is that we actually have

and the leverage decay term is just the second term in the sum multiplying . The actual leveraged portfolio we can achieve follows

which is still good enough for the expected return to be increasing in . On the other hand, if we look at the logarithm of this, we get

so now it would be optimal to choose something like if we were interested in maximizing the expected value of the logarithm of the return, i.e. in using Kelly.

The fundamental problem is that is not the good definition of the ideally leveraged portfolio, so trying to minimize the gap between and is not the same thing as maximizing the expected return of . I’m leaving the original comment up anyway because I think it’s instructive and the computation is still useful for other purposes.

• I did a Monte Carlo simulation for this on my own whose Python script you can find on Pastebin.

Consider the following model: there is a bounded martingale taking values in and with initial value . The exact process I considered was a Brownian motion-like model for the log odds combined with some bias coming from Ito’s lemma to make the sigmoid transformed process into a martingale. This process goes on until some time T and then the event is resolved according to the probability implied by . You have n “experts” who all get to observe this martingale at some idiosyncratic random time sampled uniformly from , but the times themselves are unknown to them (and to you).

In this case if you knew the expert who had the most information, i.e. who had sampled the martingale at the latest time, you’d do best to just copy his forecast exactly. You don’t know this in this setup, but in general you should believe on average that more extreme predictions came at later times, and so you should somehow give them more weight. Because of this, averaging the log odds in this setup does better than averaging the probabilities across a wide range of parameter settings. Because in this setup the information sets of different experts are as far as possible from being independent, there would also be no sense in extremizing the forecasts in any way.

In practice, as confirmed by the simulation, averaging log odds seems to do better than averaging the forecasts directly, and the gap in performance gets wider as the volatility of the process increases. This is the result I expected without doing any Monte Carlo to begin with, but it does hold up empirically, so there’s at least one case in which averaging the log odds is a better thing to do than averaging the means. Obviously you can always come up with toy examples to make any aggregation method look good, but I think modelling different experts as taking the conditional expectations of a martingale under different sigma algebras in the same filtration is the most obvious model.

• NOTE: Don’t believe everything I said in this comment! I elaborate on some of the problems with it in the responses, but I’m leaving this original comment up because I think it’s instructive even though it’s not correct.

There is a theoretical account for why portfolios leveraged beyond a certain point would have poor returns even if prices follow a random process with (almost surely) continuous sample paths: leverage decay. If you could continuously rebalance a leveraged portfolio this would not be an issue, but if you can’t do that then leverage exhibits discontinuous behavior as the frequency of rebalancing goes to infinity.

A simple way to see this is that if the underlying follows Brownian motion and the risk-free return is zero, a portfolio of the underlying leveraged k-fold and rebalanced with a period of T (which has to be small enough for these approximations to be valid) will get a return

On the other hand, the ideal leveraged portfolio that’s continuously rebalanced would get

If we assume the period T is small enough that a second order Taylor approximation is valid, the difference between these two is approximately

In particular, the difference in expected return scales linearly with the period in this regime, which means if we look at returns over the same time interval changing T has no effect on the amount of leverage decay. In particular, we can have a rule of thumb that to find the optimal (from the point of view of maximizing long-term expected return alone) leverage in a market we should maximize an expression of the form

with respect to k, which would have us choose something like . Picking the leverage factor to be any larger than that is not optimal. You can see this effect in practice if you look at how well leveraged ETFs tracking the S&P 500 perform in times of high volatility.

• I think there’s some kind of miscommunication going on here, because I think what you’re saying is trivially wrong while you seem convinced that it’s correct despite knowing about my point of view.

No it doesn’t. It weighs them by price (i.e. marginal utility = production opportunity cost) at the quantities consumed. That is not a good proxy for how important they actually were to consumers.

Yes it is—on the margin. You can’t hope for it to be globally good because of the argument I gave, but locally of course you can, that’s what marginal utility means! This is modulo the zero lower bound problem you discuss in the subsequent paragraphs, but that problem is not as significant as you might think in practice, since very few revolutions happen in such a short timespan that the zero lower bound would throw things off by much.

I’m mostly operationalizing “revolution” as a big drop in production cost.

I think the wine example is conflating two different “prices”: the consumer’s marginal utility, and the opportunity cost to produce the wine. The latter is at least extremely large, and plausibly infinite, but the former is not. If we actually somehow obtained a pallet of 2058 wine today, it would be quite a novelty, but it would sell at auction for a decidedly non-infinite price. (And if people realized how quickly its value would depreciate, it might even sell for a relatively low price, assuming there were enough supply to satisfy a few rich novelty-buyers.) The two prices are not currently equal because production has hit its lower bound (i.e. zero).

I think a pallet of wine that somehow traveled through time would sell at a very high, though not infinite, price. The fact that the price is merely “very high” instead of “infinite” doesn’t affect my argument in the least. Your claim that the two prices aren’t currently equal because of the zero lower bound problem is certainly correct, but it’s a technical objection that can be fixed by modifying the example a little bit without changing anything about its core message. For instance, you can take the good in question to be “sending a spacecraft to the surface of Mars and maintaining it there”, which currently has a nonzero consumption. It’s conceivable, at least to me, that even if the cost of doing this comes down by a factor of a billion, it won’t produce anything like a commensurate amount of consumer surplus.

My problem is, as I said before, that if “revolution” is operationalized as a big fall in production costs then your claim about “real GDP measuring growth in the production of goods that is revolutionized least” is false, because there are examples which avoid the boundary problems you bring up (so relative marginal utility is always equal to relative marginal cost) and in which a good that is revolutionized would dominate the growth in real GDP because the demand for that good is so elastic, i.e. the curvature of the utility function with respect to that good is so low.

A technological revolution does typically involve a big drop in production cost. Note, however, that this does not necessarily mean a big drop in marginal utility.

How does it not “necessarily” mean a big drop in marginal utility if you get rid of your objection related to the zero lower bound? A model in which this is not true would have to break the property that the ratio of marginal costs is equal to the ratio of marginal utilities, which is only going to happen if the optimization problem of some agent is solved at a boundary point of some choice space rather than an interior point.

Nothing in your post hints at this distinction, so I’m confused why you’re bringing it up now.

When I say “real GDP growth curves mostly tell us about the slow and steady increase in production of things which haven’t been revolutionized”, I mean something orthogonal to that. I mean that the real GDP growth curve looks almost-the-same in world without a big electronics revolution as it does in a world with a big electronics revolution.

Can you demonstrate these claims in the context of the Cobb-Douglas toy model, or if you think your argument hinges on the utility function not having a special form, can you write down a model of your own which demonstrates this “approximate invariance under revolutions” property? In my toy model your claim is obviously false (because real GDP growth is a perfect proxy for increases in utility) so I don’t understand where you’re coming from here.

• Strong upvote for the comment. I think the situation is even worse than what you say: the fact is that had Petrov simply reported the inaccurate information in his possession up the chain of command as he was being pressured to do by his own subordinates, nobody would have heard of his name and nobody would have blamed him for doing his job. He could have even informed his superiors of his personal opinion that the information he was passing to them was inaccurate and left them to make the final decision about what to do. Not only would he have not been blamed for doing that, but he would have been just one anonymous official among dozens or hundreds who had some input in the process leading up to nuclear war.

We know who Petrov is because he refused to do that, and that’s also why he faced professional sanctions for his decision. This ritual turns that on its head by sending people personalized “launch codes” and publicly announcing the name of the person who chose to “press the button” and shaming them for doing so. It’s absurd and I don’t understand why so many people in the comments see it only as a “minor problem”.

• The reason I bring up the weighting of GDP growth is that there are some “revolutions” which are irrelevant and some “revolutions” which are relevant from whatever perspective you’re judging “craziness”. In particular, it’s absurd to think that the year 2058 will be crazy because suddenly people will be able to drink wine manufactured in the year 2058 at a low cost.

Consider this claim from your post:

When we see slow, mostly-steady real GDP growth curves, that mostly tells us about the slow and steady increase in production of things which haven’t been revolutionized. It tells us approximately-nothing about the huge revolutions in e.g. electronics.

The way I interpret it, this claim is incorrect. Real GDP growth does tell you about the huge revolution in electronics, the same way that it tells you about the huge revolution in the production of wine in the year 2058. It can’t do it globally for the reasons I discussed, but it does do it locally at each point in time. The reason it appears to not tell you about it is because it (correctly) weighs each “revolution” by how important they actually were to consumers, rather than weighing them by how much the cost of production of said good fell.

I think the source of the ambiguity is that it’s not clear what you mean by a “revolution”. Do we define “revolutions” by decreases in marginal utility (i.e. prices) or by increases in overall utility (i.e. consumer surplus)? If you mean the former, then the wine example shows that it doesn’t really matter if a good is revolutionized in this sense for our judgment of how “crazy” such a change would be. If you mean the latter, then your claim that “GDP measures growth in goods that are revolutionized least” is false, because GDP is exactly designed to capture the marginal increase in consumer surplus.

• In addition, I’m confused about how you can agree with both my comment and your post at the same time. You explicitly say, for example, that

Also, “GDP (as it’s actually calculated) measures production growth in the least-revolutionized goods” still seems like basically the right intuitive model over long times and large changes, and the “takeaways” in the post still seem correct.

but this is not what GDP does. In the toy model I gave, real GDP growth perfectly captures increases in utility; and in other models where it fails to do so the problem is not that it puts less weight on goods which are revolutionized more. If a particular good being revolutionized is worth a lot in terms of welfare, then the marginal utility of that good will fall slowly even if its production expands by large factors, so real GDP will keep paying attention to it. If it is worth little, then it’s correct for real GDP to ignore it, since we can come up with arbitrarily many goods (for example, wine manufactured in the year 2058) which have an infinite cost of production until one day the cost suddenly falls from infinity to something very small.

Is it “crazy” that after 2058, people will be able to drink wine manufactured in 2058? I don’t think so, and I assume you don’t either. Presumably this is because this is a relatively useless good if we think about it in terms of the consumer surplus or utility people would derive from it, so the fact that it is “revolutionized” is irrelevant. The obvious way to correct for this is to weigh increases in the consumption of goods by the marginal utility people derive from them, which is why real GDP is a measure that works locally.

How do you reconcile this claim you make in your post with my comment?

• I think in this case omitting the discussion about equivalence under monotonic transformations leads people in the direction of macroeconomic alchemy—they try to squeeze information about welfare from relative prices and quantities even though it’s actually impossible to do it.

The correct way to think about this is probably to use von Neumann’s approach to expected utility: pick three times in history, say ; assume that where is the utility of living around time and ask people for a probability such that they would be indifferent between a certainty of living in time versus a probability of living in time and a probability of living in time . You can then conclude that

if an expected utility model is applicable to the situation, so you would be getting actual information about the relative differences in how well off people were at various times in history. Obviously we can’t set up a contingent claims market and compare the prices we would get on some assets to infer some value for , but just imagining having to make this gamble at some odds gives you a better framework to use in thinking about the question “how much have things improved, really?”

• There is a standard reason why real GDP growth is defined the way it is: it works locally in time and that’s really the best you can ask for from this kind of measure. If you have an agent with utility function defined over goods with no explicit time dependence, you can express the derivative of utility with respect to time as

If you divide both sides by the marginal utility of some good taken as the numeraire, say the first one, then you get

where is the price of good in terms of good . The right hand side is essentially change in real GDP, while the left hand side measures the rate of change of utility over time in “marginal units of ”. If we knew that the marginal utility of the numeraire were somehow constant, then changes in real GDP would be exactly proportional to changes in utility, but in general we can’t know anything like this because from prices we can only really tell the utility function up to a monotonic transformation. This means real GDP is by construction unable to tell us the answer to a question like “how much has life improved since 1960″ without some further assumptions about , since the only information about preferences incorporated into it are prices, so by construction it is incapable of distinguishing utility functions in the same equivalence class under composition by a monotonic transformation.

However, real GDP does tell you the correct thing to look at locally in time: if the time interval is relatively short so that this first order approximation is valid and the marginal utility of the numeraire is roughly constant, it tells you that the changes over that time period have improved welfare as much as some extra amount of the numeraire good would have. If you want to recover global information from that, real GDP satisfies

so what you need for real GDP growth to be a good measure of welfare is for nominal GDP (GDP in units of the numeraire) times the marginal utility of the numeraire to only be a function of , which I think is equivalent to being Cobb-Douglas up to monotonic transformation. The special nature of Cobb-Douglas also came up in another comment, but this is how it comes up here.

I think the discussion in the post is somewhat misleading. There’s really no problem that real GDP ignores goods whose price has been cut by a factor of trillion; in the toy example I gave with Cobb-Douglas utility real GDP is actually a perfect measure of welfare no matter which goods have their prices cut by how much. The problem with real GDP is that it can only work as a measure on the margin because it only uses marginal information (prices), so it’s insensitive to overall transformations of the utility function which don’t affect anything marginal.