It really is an important, well-written post, and I very much enjoyed it. I especially appreciate the twin studies example. I even think that something like that should maybe go into the wikitags, because of how often the title sentence appears everywhere? I’m relatively new to LessWrong though, so I’m not sure about the posts/wikitags distinction, maybe that’s not how it’s done here.
I have a pitch for how to make it even better though. I think the part about “when you have lots of data” vs “when you have less data” would be cleaner and more intuitive if it were rewritten as “when X is discrete vs continuous”. Now the first example (the “more data” one) uses a continuous X; thus, the sentence “define yi as the sample mean of Y taken over all yj for which xj=xi” creates confusion, since it’s literally impossible to get the same value from a truly continuous random variable twice; it requires some sort of binning or something, which, yes, you do explain later. So it doesn’t really flow as a “when you have lots of data” case—nobody does that in practice with truly continuous X, no matter how much data (at least as far as I know).
Now say we have a discrete X: e.g., an observation can come from classes A, B, or C. We have a total of n observations, nj from class j. Turning the main spiel into numbers becomes straightforward:
On average, over all different values of X weighted by their probability, the remaining variance in Y is 1−p times the total variance in Y.
“Over all different values of X” → which we have three of;
“weighted by their probability” → we approximate the true probability of belonging to class j as njn, obviously;
“the remaining variance in Y” for class j is ^Varj=1nj−1∑nji=1(yij−¯yj)2, also obviously. And we are done, no excuses or caveats needed! The final formula becomes:
1−p=1n∑3j=1nj^Varj^Vartot
An example: (Y∣X)∼N(μX,σX). Since we are creating the model, we know the true “platonic” explained variance. In this example, it’s about 0.386. An estimated explained variance on an n=200 sample came out as 0.345 (code)
After that, we can say that directly approximating the variance of Y∣X for every value of a continuous X is impossible, so we need a regression model.
And also that way it prepares the reader for the twin study example, which then can be introduced as a discrete case with each “class” being a unique set of genes, where nj always equals two.
If you do decide that it’s a good idea, but don’t feel like rewriting it, I guess we can go colab on the post and I can write that part. Anyway, please let me know your thoughts if you feel like it.
I think it’s right that the distinction “lots of data” and “less data” doesn’t really carve reality at its natural joints. I feel like your distinction between “discrete” and “continuous” X also doesn’t fully do this since you could imagine a case of discrete X where we have only one y for each x in the dataset, and thus need regression, too (at least, in principle).
I think the real distinction is probably whether we have “several y’s for each x” in the dataset, or not. The twin dataset case has that, and so even though it’s not a lot of data (only 32 pairs, or 64 total samples), we can essentially apply what I called the “lots of data” case.
Now, I have to admit that by this point I’m somewhat attached to the imperfect state of this post and won’t edit it anymore. But I’ve strongly upvoted your comment and weakly agreed with it, and I hope some confused readers will find it.
It really is an important, well-written post, and I very much enjoyed it. I especially appreciate the twin studies example. I even think that something like that should maybe go into the wikitags, because of how often the title sentence appears everywhere? I’m relatively new to LessWrong though, so I’m not sure about the posts/wikitags distinction, maybe that’s not how it’s done here.
I have a pitch for how to make it even better though. I think the part about “when you have lots of data” vs “when you have less data” would be cleaner and more intuitive if it were rewritten as “when X is discrete vs continuous”. Now the first example (the “more data” one) uses a continuous X; thus, the sentence “define yi as the sample mean of Y taken over all yj for which xj=xi” creates confusion, since it’s literally impossible to get the same value from a truly continuous random variable twice; it requires some sort of binning or something, which, yes, you do explain later. So it doesn’t really flow as a “when you have lots of data” case—nobody does that in practice with truly continuous X, no matter how much data (at least as far as I know).
Now say we have a discrete X: e.g., an observation can come from classes A, B, or C. We have a total of n observations, nj from class j. Turning the main spiel into numbers becomes straightforward:
1−p=1n∑3j=1nj^Varj^Vartot“Over all different values of X” → which we have three of;
“weighted by their probability” → we approximate the true probability of belonging to class j as njn, obviously;
“the remaining variance in Y” for class j is ^Varj=1nj−1∑nji=1(yij−¯yj)2, also obviously. And we are done, no excuses or caveats needed! The final formula becomes:
An example: (Y∣X)∼N(μX,σX). Since we are creating the model, we know the true “platonic” explained variance. In this example, it’s about 0.386. An estimated explained variance on an n=200 sample came out as 0.345 (code)
After that, we can say that directly approximating the variance of Y∣X for every value of a continuous X is impossible, so we need a regression model.
And also that way it prepares the reader for the twin study example, which then can be introduced as a discrete case with each “class” being a unique set of genes, where nj always equals two.
If you do decide that it’s a good idea, but don’t feel like rewriting it, I guess we can go colab on the post and I can write that part. Anyway, please let me know your thoughts if you feel like it.
Thanks for the comment Stepan!
I think it’s right that the distinction “lots of data” and “less data” doesn’t really carve reality at its natural joints. I feel like your distinction between “discrete” and “continuous” X also doesn’t fully do this since you could imagine a case of discrete X where we have only one y for each x in the dataset, and thus need regression, too (at least, in principle).
I think the real distinction is probably whether we have “several y’s for each x” in the dataset, or not. The twin dataset case has that, and so even though it’s not a lot of data (only 32 pairs, or 64 total samples), we can essentially apply what I called the “lots of data” case.
Now, I have to admit that by this point I’m somewhat attached to the imperfect state of this post and won’t edit it anymore. But I’ve strongly upvoted your comment and weakly agreed with it, and I hope some confused readers will find it.