I want to say this more clearly and simply somehow. Something like ‘adding up a series of conditional modes does not give the overall mode’? (And for nonnegative, right-skewed things like timelines, it’ll systematically underestimate except maybe in the presence of very particular correlations?)
Here’s a try at phrasing it with less probability jargon:
The forecast contains a number of steps, all of which are assumed to take our best estimate of their most likely time. But in reality, unless we’re very lucky, some of those steps will be faster than predicted, and some will be slower. The ones that are faster can only be so much faster (because they can’t take no time at all). On the other hand, the ones that are slower can be much slower. So the net effect of this uncertainty probably adds up to a slowdown relative to the prediction.
Generate an image randomly with each pixel black with 51% chance and white with 49% chance, independently. The most likely image? Totally black. But virtually all the probability mass is on images which are ~49% white. Adding correlations between neighbouring pixels (or, in 1D, correlations between time series events) doesn’t remove this problem, despite what you might assume.
The core problem is that the mode of a high-dimensional probability distribution is typically degenerate. (Aside, it also causes problems for parameter estimation of unnormalized energy-based models, an extremely broad class, because you should sample from them to normalize; maximum probability estimates can be dangerous.)
Statistical mechanics points to the solution: knowing the most likely microstate of a box of particles doesn’t tell you anything; physicists care about macrostates, which are observables. You define a statistic (any function of the data, which somehow summarizes it) which you actually care about, and then take the mode of that. For example, number of breakthrough discoveries by time t.
An intuition you might be able to invoke is that the procedure they describe is like greedy sampling from an LLM, which doesn’t get you the most probable completion.
I want to say this more clearly and simply somehow. Something like ‘adding up a series of conditional modes does not give the overall mode’? (And for nonnegative, right-skewed things like timelines, it’ll systematically underestimate except maybe in the presence of very particular correlations?)
Here’s a try at phrasing it with less probability jargon:
The forecast contains a number of steps, all of which are assumed to take our best estimate of their most likely time. But in reality, unless we’re very lucky, some of those steps will be faster than predicted, and some will be slower. The ones that are faster can only be so much faster (because they can’t take no time at all). On the other hand, the ones that are slower can be much slower. So the net effect of this uncertainty probably adds up to a slowdown relative to the prediction.
Does that seem like a fair summary?
Generate an image randomly with each pixel black with 51% chance and white with 49% chance, independently. The most likely image? Totally black. But virtually all the probability mass is on images which are ~49% white. Adding correlations between neighbouring pixels (or, in 1D, correlations between time series events) doesn’t remove this problem, despite what you might assume.
The core problem is that the mode of a high-dimensional probability distribution is typically degenerate. (Aside, it also causes problems for parameter estimation of unnormalized energy-based models, an extremely broad class, because you should sample from them to normalize; maximum probability estimates can be dangerous.)
Statistical mechanics points to the solution: knowing the most likely microstate of a box of particles doesn’t tell you anything; physicists care about macrostates, which are observables. You define a statistic (any function of the data, which somehow summarizes it) which you actually care about, and then take the mode of that. For example, number of breakthrough discoveries by time t.
An intuition you might be able to invoke is that the procedure they describe is like greedy sampling from an LLM, which doesn’t get you the most probable completion.