I did a Monte Carlo simulation for this on my own whose Python script you can find on Pastebin.

Consider the following model: there is a bounded martingale M taking values in [0,1] and with initial value 1/2. The exact process I considered was a Brownian motion-like model for the log odds combined with some bias coming from Ito’s lemma to make the sigmoid transformed process into a martingale. This process goes on until some time T and then the event is resolved according to the probability implied by M(T). You have n “experts” who all get to observe this martingale at some idiosyncratic random time sampled uniformly from [0,T], but the times themselves are unknown to them (and to you).

In this case if you knew the expert who had the most information, i.e. who had sampled the martingale at the latest time, you’d do best to just copy his forecast exactly. You don’t know this in this setup, but in general you should believe on average that more extreme predictions came at later times, and so you should somehow give them more weight. Because of this, averaging the log odds in this setup does better than averaging the probabilities across a wide range of parameter settings. Because in this setup the information sets of different experts are as far as possible from being independent, there would also be no sense in extremizing the forecasts in any way.

In practice, as confirmed by the simulation, averaging log odds seems to do better than averaging the forecasts directly, and the gap in performance gets wider as the volatility of the process M increases. This is the result I expected without doing any Monte Carlo to begin with, but it does hold up empirically, so there’s at least one case in which averaging the log odds is a better thing to do than averaging the means. Obviously you can always come up with toy examples to make any aggregation method look good, but I think modelling different experts as taking the conditional expectations of a martingale under different sigma algebras in the same filtration is the most obvious model.

In this case if you knew the expert who had the most information, i.e. who had sampled the martingale at the latest time, you’d do best to just copy his forecast exactly.

Nope! If n=1, then you do know which expert has the most information, and you don’t do best by copying his forecast, because the experts in your model are overconfident. See my reply to ADifferentAnonymous.

But well-done constructing a model in which average log odds outperforms average probabilities for compelling reasons.

The probability of the event is the expected value of the probability implied by M(T). The experts report M(X) for a random variable X sampled uniformly in [0,T]. M(T) differs from M(X) by a Gaussian of mean 0, and hence, knowing M(X), the expected value of M(T) is just M(X). But we want the expected value of the probability implied by M(T), which is different from the probability implied by the expected value of M(T), because expected value does not commute with nonlinear functions. So an expert reporting the probability implied by M(X) is not well-calibrated, even though an expert reporting M(X) is giving an unbiased estimate of M(T).

I don’t know what you’re talking about here. You don’t need any nonlinear functions to recover the probability. The probability implied by M(T) is just M(T), and the probability you should forecast having seen M(X) is therefore

P(E|M(X))=E[1E|FX]=E[E[1E|FT]|FX]=E[M(T)|FX]=M(X)

since M is a martingale.

I think you don’t really understand what my example is doing.M is not a Brownian motion and its increments are not Gaussian; it’s a nonlinear transform of a drift-diffusion process by a sigmoid which takes values in [0,1]. M itself is already a martingale so you don’t need to apply any nonlinear transformation to M on top of that in order to recover any probabilities.

The explicit definition is that you take an underlying drift-diffusion process Y following

dY=σ2(eY−1eY+1)dt+σdz

and let M=1−1/(eY+1). You can check that this M is a martingale by using Ito’s lemma.

If you’re still not convinced, you can actually use my Python script in the original comment to obtain calibration data for the experts using Monte Carlo simulations. If you do that, you’ll notice that they are well calibrated and not overconfident.

That’s alright, it’s partly on me for not being clear enough in my original comment.

I think information aggregation from different experts is in general a nontrivial and context-dependent problem. If you’re trying to actually add up different forecasts to obtain some composite result it’s probably better to average probabilities; but aside from my toy model in the original comment, “field data” from Metaculus also backs up the idea that on single binary questions median forecasts or log odds average consistently beats probability averages.

I agree with SimonM that the question of which aggregation method is best has to be answered empirically in specific contexts and theoretical arguments or models (including mine) are at best weakly informative about that.

I did a Monte Carlo simulation for this on my own whose Python script you can find on Pastebin.

Consider the following model: there is a bounded martingale M taking values in [0,1] and with initial value 1/2. The exact process I considered was a Brownian motion-like model for the log odds combined with some bias coming from Ito’s lemma to make the sigmoid transformed process into a martingale. This process goes on until some time T and then the event is resolved according to the probability implied by M(T). You have n “experts” who all get to observe this martingale at some idiosyncratic random time sampled uniformly from [0,T], but the times themselves are unknown to them (and to you).

In this case if you knew the expert who had the most information, i.e. who had sampled the martingale at the latest time, you’d do best to just copy his forecast exactly. You don’t know this in this setup, but in general you should believe on average that more extreme predictions came at later times, and so you should somehow give them more weight. Because of this, averaging the log odds in this setup does better than averaging the probabilities across a wide range of parameter settings. Because in this setup the information sets of different experts are as far as possible from being independent, there would also be no sense in extremizing the forecasts in any way.

In practice, as confirmed by the simulation, averaging log odds seems to do better than averaging the forecasts directly, and the gap in performance gets wider as the volatility of the process M increases. This is the result I expected without doing any Monte Carlo to begin with, but it does hold up empirically, so there’s at least one case in which averaging the log odds is a better thing to do than averaging the means. Obviously you can always come up with toy examples to make any aggregation method look good, but I think modelling different experts as taking the conditional expectations of a martingale under different sigma algebras in the same filtration is the most obvious model.

Nope! If n=1, then you do know which expert has the most information, and you don’t do best by copying his forecast, because the experts in your model are overconfident. See my reply to ADifferentAnonymous.

But well-done constructing a model in which average log odds outperforms average probabilities for compelling reasons.

The experts in my model are designed to be perfectly calibrated. What do you mean by “they are overconfident”?

The probability of the event is the expected value of the probability implied by M(T). The experts report M(X) for a random variable X sampled uniformly in [0,T]. M(T) differs from M(X) by a Gaussian of mean 0, and hence, knowing M(X), the expected value of M(T) is just M(X). But we want the expected value of the probability implied by M(T), which is different from the probability implied by the expected value of M(T), because expected value does not commute with nonlinear functions. So an expert reporting the probability implied by M(X) is not well-calibrated, even though an expert reporting M(X) is giving an unbiased estimate of M(T).

I don’t know what you’re talking about here. You don’t need any nonlinear functions to recover the probability. The probability implied by M(T) is just M(T), and the probability you should forecast having seen M(X) is therefore

P(E|M(X))=E[1E|FX]=E[E[1E|FT]|FX]=E[M(T)|FX]=M(X)since M is a martingale.

I think you don’t really understand what my example is doing.M is not a Brownian motion and its increments are not Gaussian; it’s a nonlinear transform of a drift-diffusion process by a sigmoid which takes values in [0,1]. M itself is already a martingale so you don’t need to apply any nonlinear transformation to M on top of that in order to recover any probabilities.

The explicit definition is that you take an underlying drift-diffusion process Y following

dY=σ2(eY−1eY+1)dt+σdzand let M=1−1/(eY+1). You can check that this M is a martingale by using Ito’s lemma.

If you’re still not convinced, you can actually use my Python script in the original comment to obtain calibration data for the experts using Monte Carlo simulations. If you do that, you’ll notice that they are well calibrated and not overconfident.

Oh, you’re right, sorry; I’d misinterpreted you as saying that M represented the log odds. What you actually did was far more sensible than that.

That’s alright, it’s partly on me for not being clear enough in my original comment.

I think information aggregation from different experts is in general a nontrivial and context-dependent problem. If you’re trying to actually add up different forecasts to obtain some composite result it’s probably better to average probabilities; but aside from my toy model in the original comment, “field data” from Metaculus also backs up the idea that on single binary questions median forecasts or log odds average consistently beats probability averages.

I agree with SimonM that the question of which aggregation method is best has to be answered empirically in specific contexts and theoretical arguments or models (including mine) are at best weakly informative about that.