Suppose you know a golfer’s score on day 1 and are asked to predict his score on day 2. You expect the golfer to retain the same level of talent on the second day, so your best guesses will be “above average” for the [better-scoring] player and “below average” for the [worse-scoring] player. Luck, of course, is a different matter. Since you have no way of predicting the golfers’ luck on the second (or any) day, your best guess must be that it will be average, neither good nor bad. This means that in the absence of any other information, your best guess about the players’ score on day 2 should not be a repeat of their performance on day 1. …
The best predicted performance on day 2 is more moderate, closer to the average than the evidence on which it is based (the score on day 1). This is why the pattern is called regression to the mean. The more extreme the original score, the more regression we expect, because an extremely good score suggests a very lucky day. The regressive prediction is reasonable, but its accuracy is not guaranteed. A few of the golfers who scored 66 on day 1 will do even better on the second day, if their luck improves. Most will do worse, because their luck will no longer be above average.
Now let us go against the time arrow. Arrange the players by their performance on day 2 and look at their performance on day 1. You will find precisely the same pattern of regression to the mean. … The fact that you observe regression when you predict an early event from a later event should help convince you that regression does not have a causal explanation.
If you know the scores of two different golfers on day 1, then you know more than if you know the score of only one golfer on day 1. You can’t predict the direction in which regression to the mean will occur if your data set is a single point.
The following all have different answers:
I play a certain video game a lot. The last time I played it, my score was 39700. What’s your best guess for my score the next time I play it?
(The answer is 39700; I’m probably not going to improve with practice, and you have no way to know if 39700 is unusually good or unusually bad.)
My friend and I both play a certain video game a lot. The last time I played it, my score was 39700. The last time my friend played it, his score was 32100. What’s your best guess for my score the next time I play it?
(The answer is some number less than 39700; knowing that my friend got a lower score gives you a reason to believe that 39700 might be higher than normal.)
I played a video game for the first time yesterday. My score was 39700, and higher scores are better than lower ones. What’s your best guess for my score the next time I play it?
(The answer is some number higher than 39700, because I’m no longer an absolute beginner.)
True, a single data point can’t give you knowledge of regression effects. In the context of the original problem, Kahneman assumed that you had access to the average score of all the golfers on the first day.
I played a video game for the first time yesterday. My score was 39700, and higher scores are better than lower ones. What’s your best guess for my score the next time I play it?
(The answer is some number higher than 39700, because I’m no longer an absolute beginner.)
I’m not sure it’s true that the answer is higher than 39700, in this case. It depends on if you have knowledge of how people generally improve, and if your score is higher than average for an absolute beginner. Since unknown factors could adjust the score either up or down, I would probably just guess that it will be the same the next day.
The existence of factors which could adjust the score either up or down does not indicate which factors dominate. In this case, you have no information which suggests that 39700 is either above or below the median, and therefore these two cases must be assigned equal probability—canceling out any “regression to the mean” effects you could have predicted. Similar arguments apply to other effects which change the score.
So you estimate “regression to the mean” effects as zero, and base your estimate on any other effects you know about and how strong you think they are. That makes sense. Thanks for the correction!
In this case, you have no information which suggests that 39700 is either above or below the median, and therefore these two cases must be assigned equal probability
Not quite, you have some background information about the range of scores video games usually employ.
And, I suppose, information about the probability of people mentioning average scores. I concede that either factor could justify arguing that the score should decrease.
It reminds me of E.T. Jaynes’ explanation of why time-reversible dynamic laws for (say) sugar molecules in water lead to a time-irreversible diffusion equation.
Daniel Kahneman, Thinking, Fast and Slow
If you know the scores of two different golfers on day 1, then you know more than if you know the score of only one golfer on day 1. You can’t predict the direction in which regression to the mean will occur if your data set is a single point.
The following all have different answers:
(The answer is 39700; I’m probably not going to improve with practice, and you have no way to know if 39700 is unusually good or unusually bad.)
(The answer is some number less than 39700; knowing that my friend got a lower score gives you a reason to believe that 39700 might be higher than normal.)
(The answer is some number higher than 39700, because I’m no longer an absolute beginner.)
True, a single data point can’t give you knowledge of regression effects. In the context of the original problem, Kahneman assumed that you had access to the average score of all the golfers on the first day.
I’m not sure it’s true that the answer is higher than 39700, in this case. It depends on if you have knowledge of how people generally improve, and if your score is higher than average for an absolute beginner. Since unknown factors could adjust the score either up or down, I would probably just guess that it will be the same the next day.
The existence of factors which could adjust the score either up or down does not indicate which factors dominate. In this case, you have no information which suggests that 39700 is either above or below the median, and therefore these two cases must be assigned equal probability—canceling out any “regression to the mean” effects you could have predicted. Similar arguments apply to other effects which change the score.
So you estimate “regression to the mean” effects as zero, and base your estimate on any other effects you know about and how strong you think they are. That makes sense. Thanks for the correction!
Not quite, you have some background information about the range of scores video games usually employ.
And, I suppose, information about the probability of people mentioning average scores. I concede that either factor could justify arguing that the score should decrease.
It reminds me of E.T. Jaynes’ explanation of why time-reversible dynamic laws for (say) sugar molecules in water lead to a time-irreversible diffusion equation.