jimmy comments on Simpson’s paradox and the tyranny of strata

jimmy 20 Nov 2020 17:21 UTC
3 points
(If you’re worried about the difference being due to random chance, feel free to multiply the number of animals by a million.)
[...]
They vary from these patterns, but never enough that they are flying the same route on the same day at the same time at the same time of year. If you want to compare, you can group flights by cities or day or time or season, but not all of them.
The problem you’re using Simpson’s paradox to point at does not have this same property of “multiplying the size of the data set by arbitrarily large numbers doesn’t help”. If you can keep taking data until randomness chance is no issue, then they will end up having sufficient data in all the same subgroups, and you can just read the correct answer off the last million times they both flew in the same city/day/time/season simultaneously.
The problem you’re pointing at fundamentally boils down to not having enough data to force your conclusions, and therefore needing to make judgement about how important season is compared to time of day so that you can determine when conditioning on more factors will help relevance more than it will hurt by adding more noise.
- Pattern 20 Nov 2020 18:16 UTC
  2 points
  Parent
  So you just need enough data that the events involving entities is much greater than the number of parameters.
  - dynomight 20 Nov 2020 19:04 UTC
    1 point
    Parent
    You definitely need a number of data at least exponential in the number of parameters, since the number of “bins” is exponential. (It’s not so simple as to say that exponential is enough because it depends on the distributional overlap. If there are cases where one group never hits a given bin, then even an infinite amount of data doesn’t save you.)
- dynomight 20 Nov 2020 18:58 UTC
  1 point
  Parent
  I see what you’re saying, but I was thinking of a case where there is zero probability of having overlap among all features. While that technically restores the property that you can multiply the dataset by arbitrarily large numbers, if feels a little like “cheating” and I agree with your larger point.
  
  I guess Simpson’s paradox does always have a right answer in “stratify along all features”, it’s just that the amount of data you need increases exponentially in the number of relevant features. So I think that in the real world you can multiply the amount of data by a very, very large number and it won’t solve the problem, even though in a large enough number will.
  
  In the real world it’s often also sort of an open question if the number of “features” is finite or not.