# AlexMennen(Alex Mennen)

Karma: 3,968
• Why would questions where uninformed forecasters produce uniform priors make logodds averaging work better?

Because it produces situations where more extreme probability estimates correlate with more expertise (assuming all forecasters are well-calibrated).

I don’t understand your point. Why would forecasters care about what other people would do? They only want to maximize their own score.

They wouldn’t. But if both would have started with priors around 50% before they acquired any of their expertise, and it’s their expertise that updates them away from 50%, then more expertise is required to get more extreme odds. If the probability is a martingale that starts at 50%, and the time axis is taken to be expertise, then more extreme probabilities will on average be sampled from later in the martingale; i.e. with more expertise.

This also doesn’t make much sense to me, though it might be because I still don’t understand the point about needing uniform priors for logodd pooling.

If logodd pooling implicitly assumes a uniform prior, then logodd pooling on A vs A assumes A has prior probability 12, and logodd pooling on A vs B vs C assumes A has a prior of 13, which, if the implicit prior actually was important, could explain the different results.

• (Possibly a bit of a tangent) It occurred to me while reading this that perhaps average log odds could make sense in the context in which there is a uniform prior, and the probabilities provided by experts differ because the experts disagree on how to interpret evidence that brings them away from the uniform prior. This has some intuitive appeal:

1) Perhaps, when picking questions to ask forecasters, people have a tendency to pick questions for which they believe the probability that the answer is yes is approximately 50%, because that offers the most opportunity to update in response to the beliefs of the forecasters. If average log odds is an appropriate pooling method to use if you have a uniform prior, then this would explain its good empirical performance. I think I mentioned in our discussion on your EA forum post that if there is a tendency for more knowledgeable forecasters to give more extreme probabilities, then this would explain good performance by average log odds, which weights extreme predictions heavily. A tendency for the questions asked to have priors of near 50% according to the typical unknowledgeable person would explain why more knowledgeable forecasters would assign more extreme probabilities on average: it takes more expertise to justifiably bring their probabilities further from 50%.

2) It excuses the incoherent behavior of average log odds on my ABC example as well. If A, B, and C are mutually exclusive, then they can’t all have 50% prior probability, so a pooling method that implicitly assumes that they do will not give coherent results.

Ultimately, though, I don’t think this is actually true. Consider the example of forecasting a continuous variable x by soliciting probability density functions and from two experts, and pooling them to get the pdf proportional to (renormalized so it integrates to 1). You could also consider forecasting the variable for some differentiable, strictly increasing function f. Then your experts give you pdfs and satisfying , and you pool them to get the pdf proportional to . I claim that, if what we’re doing implicitly depends on a uniform prior in a sneaky way, that the first thing should be the appropriate thing to do if x has a uniform prior, and the second thing should be appropriate if y has a uniform prior. If f is nonlinear, then a uniform prior on x induces a non-uniform prior on y, and vice-versa, so we should get incompatible results from each way of doing this, as we were implicitly using different priors each time. But let’s try it: . Thus, given that both experts provided pdfs satisfying the formula making their probability distributions on x and y compatible with , our pooled pdfs also satisfies that formula, and is also compatible with . That is, if we pooled using beliefs about x, and then find the implied beliefs about y, we get the same thing as if we directly pooled using beliefs about y. Different implicit priors don’t appear to be ruining anything.

I conclude that the incoherent results in my ABC example cannot be blamed on switching between the uniform prior on {A,B,C} and the uniform prior on {A,A}, and, instead, should be blamed entirely on the experts having different beliefs conditional on A, which is taken account in the calculation using A,B,C, but not in the calculation using A,A.

• The probability of the event is the expected value of the probability implied by M(T). The experts report M(X) for a random variable X sampled uniformly in [0,T]. M(T) differs from M(X) by a Gaussian of mean 0, and hence, knowing M(X), the expected value of M(T) is just M(X). But we want the expected value of the probability implied by M(T), which is different from the probability implied by the expected value of M(T), because expected value does not commute with nonlinear functions. So an expert reporting the probability implied by M(X) is not well-calibrated, even though an expert reporting M(X) is giving an unbiased estimate of M(T).

• In SimonM’s comment, we’re talking about probabilities directly. Forecasting. Usually that means what we care about is calibration or a proper scoring rule, so the natural scale is [0,1] or log-odds. Now the correct heuristic is “arithmetic mean” (of log-odds of probability).

Not sure what you mean by this. A proper scoring rule incentivizes the same results that deciding what odds you’d be indifferent to betting on at (against a gambler whose decisions carry no information about reality) does.

• In this case if you knew the expert who had the most information, i.e. who had sampled the martingale at the latest time, you’d do best to just copy his forecast exactly.

Nope! If n=1, then you do know which expert has the most information, and you don’t do best by copying his forecast, because the experts in your model are overconfident. See my reply to ADifferentAnonymous.

But well-done constructing a model in which average log odds outperforms average probabilities for compelling reasons.

• That doesn’t work, even in the case where the number of probability estimates you’re trying to aggregate together is one. The geometric mean of a set of one number is just that number, so the claim that average log odds is the appropriate way to handle this situation implies that if you are given one probability estimate from this procedure, the appropriate thing to do is take it literally, but this is not the case. Instead, you should try to adjust out the expected effect of the gaussian noise. The correct way to do this depends on your prior, but for simplicity and to avoid privileging any particular prior, let’s try using the improper prior such that seeing the probability estimate gives you no information on what the gaussian noise term was. Then your posterior distribution over the “true log odds” is the observed log odds estimate plus a gaussian. The expected value of the true log odds is, of course, the observed log odds estimate, but the expected value of the true probability is not the observed probability estimate; taking the expected value does not commute with applying nonlinear functions like converting between log odds and probabilities.

• I agree, and in fact I already gave almost the same example in the original post. My claim was not that averaging probabilities is always appropriate, just that it is often reasonable, and average log odds never is.

Your example about additivity of disjoint events is somewhat contrived. Averaging log-odds respects the probability for a given event summing to 1, but if you add some additional structure it might not make sense, I agree.

Contrived how? What additional structure do you imagine I added? In what sense do you claim that averaging log odds preserves additivity of probability for disjoint events in the face of an example showing that the straightforward interpretation of this claim is false?

Averaging log-odds is exactly a Bayesian update

It isn’t; you can tell because additivity of probability for disjoint events continues to hold after Bayesian updates. [Edit: Perhaps a better explanation for why it isn’t a Bayesian update is that it isn’t even the same type signature as a Bayesian update. A Bayesian update takes a probability distribution and some evidence, and returns a probability distribution. Averaging log-odds takes some finite set of probabilities, and returns a probability]. I’m curious what led you to believe this, though.

# Aver­age prob­a­bil­ities, not log odds

12 Nov 2021 21:39 UTC
16 points
• The combination of the two proposed explanations for why certain fields have a higher rate of one-boxing than others seems kind of plausible, but also very suspicious, because being more like decision theorists than like normies (and thus possibly getting more exposure to pro-two-boxing arguments that are popular among decision theorists) seems very similar to being more predisposed to good critical thinking on these sorts of topics (and thus possibly more likely to support one-boxing for correct reasons), so, by combining these two effects, we can explain why people in some subfield might be more likely than average to one-box and also why people in that same subfield might be more likely than average to two-box, and just pick whichever of these explanations correctly predicts whatever people in that field end up answering.

Of course, this complaint makes it seem especially strange state that two-boxing ended up being so popular among decision theorists.

• Timelines are short because the path to AGI is (blah blah)

This requires a high degree of precision about your knowledge of the path to AGI, which makes it seem not that plausible, unless timelines are very short no matter what you say because others will stumble their way through the path you’ve identified soon anyway.

• I was discouraged from writing a blog post estimating when AI would be developed, on the basis that a real conversation about this topic among rationalists would cause AI to come sooner, which would be more dangerous

Does anyone actually believe and/​or want to defend this? I have a strong intuition that public-facing discussion of AI timelines within the rationalist and AI alignment communities is highly unlikely to have a non-negligible effect on AI timelines, especially in comparison to the potential benefit it could have for the AI alignment community being better able to reason about something very relevant to the problem they are trying to solve. (Ditto for probably most but not all topics regarding AGI that people interested in AI alignment may be tempted to discuss publicly.)

• This sort of thing seems to suggest that EY’s claims in this post about the scale of the relative intelligence differences between chimps, a village idiot, and Einstein is incorrect. The difference in intelligence between village idiot and Einstein may be comparable to the difference in intelligence between some nonhuman animals and a human village idiot. Which is a priori surprising, given that human brains are very structurally similar to each other in comparison to nonhuman animal brains.

• I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don’t think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price if , even if is not the optimal action (let’s say is the optimal action). When the time comes to take an action, the agent’s best bet is (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn’t choose between and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn’t a Dutch book, because the case where the agent’s top choice goes through is the case in which the contract is worthless, and the contract’s value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between and even while its top choice is . That is, there’s a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between and . In this case, contrary to your argument, the agent will prefer rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing as the only action with high probability, and refers to the expected consequence of the agent choosing as its intended action, so the agent cannot use when calculating which of or is better to pick as its next choice if its attempt to implement intended action fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that and are the agent’s expectations at time of the utility that taking action would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent’s uncertainty about its own decisions, then at time , it knows what action it is taking, and is undefined on actions not taken. and should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say and in the description of the contract, this means and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the appearing in the description of the contract instead means (since the agent doesn’t know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won’t purchase it in the first place because it will know that the contract will only have value if seems to be suboptimal at time and it takes action anyway, which it knows won’t happen, and hence the contract is worthless.

• The Nirvana trick seems like a cheap hack, and I’m curious if there’s a way to see it as good reasoning.

One response to this was that predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance, which is a sensible thing to say as a prediction that that circumstance is impossible.

• That’s exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.