better methodology would have been to use piecewise (or “hockey-stick”) regression, which assumes the data is broken into 2 sections (typically one sloping downwards and one sloping upwards), and tries to find the right breakpoint, and perform a separate linear regression on each side of the break that meets at the break. (I almost called this “The case of the missing hockey-stick”, but thought that would give the answer away.)
An even better methodology would be to allow for higher order terms in the regression model. Adding squared terms, the model would look like this:
Y=a_1Xb_1X2c
or
Y=a_1X_1b_1X_12a_2X_2b_2X_22...a_nX_1bnX_n2c
This would allow for nice those nice looking curves you were talking about. And it can be combined with logistic regression. Really, regression is very flexible; there’s no excuse for what they did.
Also, the scientists could have done a little model checking. If what Phil says about the U/J shaped response curve is true, the first order model would have been rejected by some sensible model selection criterion (AIC, BIC, stepwise selection, lack-of-fit F test, etc)
related side note: In my grad stat classes, “Linear Regression” usually includes things like my example above—i.e. linear functions of the (potentially transformed) explanatory variables including higher order terms. Is this different from the how the term is widely used?
unrelated side note: is there a way to type pretty math in the comments?
followup question: are scientists outside of the field of statistics really this dumb when it comes to statistics? It seems like they see their standard methods (i.e., regression) as black boxes that take data as an input and then output answers. Maybe my impression is skewed by the examples popping up here on LW.
related side note: In my grad stat classes, “Linear Regression” usually includes things like my example above—i.e. linear functions of the (potentially transformed) explanatory variables including higher order terms. Is this different from the how the term is widely used?
I don’t think it is. I seem to remember reading in Wonnacott & Wonnacott’s textbook that you can still call it ‘linear regression’ whether or not one of those regressors is a nonlinear function of another.
That makes sense intuitively, because a linear regression algorithm doesn’t care where your regressors come from, so conceptually it’s irrelevant whether they all turn out to be different functions of the same variable (for example). (Barring obvious exceptions like your regressors all being linear functions of the same variable, which would of course mess up your regression.)
unrelated side note: is there a way to type pretty math in the comments?
I don’t know of one, but I haven’t been here long!
followup question: are scientists outside of the field of statistics really this dumb when it comes to statistics?
My understanding is, a lot of them aren’t...but a lot of them are.
Yes. Quadratic regression is better, often. The problem is that the number of coefficients to adjust in the model gets squared, which goes against Ockhams razor. This is precisely the problem I am working on these days, though in the context of the oil industry.
It’s not too difficult to check to see if adding the extra terms improves the regression. In my original comment, I listed AIC and BIC among others. On the other hand, different diagnostics will give different answers, so there’s the question of which diagnostic to trust if they disagree. I haven’t learned much about regression diagnostics yet, but at the moment they all seem ad hoc (maybe because I haven’t seen the theory behind them yet).
In my grad stat classes, “Linear Regression” usually includes things like my example above—i.e. linear functions of the (potentially transformed) explanatory variables including higher order terms. Is this different from the how the term is widely used?
If you say W = XxX, then make a model that’s linear in W, it’s a linear model. If you use both X and XxX, I don’t think there was a definitive answer… until Wikipedia, of course. Which says no.
An even better methodology would be to allow for higher order terms in the regression model. Adding squared terms, the model would look like this:
Y=a_1Xb_1X2c
or
Y=a_1X_1b_1X_12a_2X_2b_2X_22...a_nX_1bnX_n2c
This would allow for nice those nice looking curves you were talking about. And it can be combined with logistic regression. Really, regression is very flexible; there’s no excuse for what they did.
Also, the scientists could have done a little model checking. If what Phil says about the U/J shaped response curve is true, the first order model would have been rejected by some sensible model selection criterion (AIC, BIC, stepwise selection, lack-of-fit F test, etc)
related side note: In my grad stat classes, “Linear Regression” usually includes things like my example above—i.e. linear functions of the (potentially transformed) explanatory variables including higher order terms. Is this different from the how the term is widely used?
unrelated side note: is there a way to type pretty math in the comments?
followup question: are scientists outside of the field of statistics really this dumb when it comes to statistics? It seems like they see their standard methods (i.e., regression) as black boxes that take data as an input and then output answers. Maybe my impression is skewed by the examples popping up here on LW.
Yes: Comment formatting
thanks!
I don’t think it is. I seem to remember reading in Wonnacott & Wonnacott’s textbook that you can still call it ‘linear regression’ whether or not one of those regressors is a nonlinear function of another.
That makes sense intuitively, because a linear regression algorithm doesn’t care where your regressors come from, so conceptually it’s irrelevant whether they all turn out to be different functions of the same variable (for example). (Barring obvious exceptions like your regressors all being linear functions of the same variable, which would of course mess up your regression.)
I don’t know of one, but I haven’t been here long!
My understanding is, a lot of them aren’t...but a lot of them are.
Yes. Quadratic regression is better, often. The problem is that the number of coefficients to adjust in the model gets squared, which goes against Ockhams razor. This is precisely the problem I am working on these days, though in the context of the oil industry.
It’s not too difficult to check to see if adding the extra terms improves the regression. In my original comment, I listed AIC and BIC among others. On the other hand, different diagnostics will give different answers, so there’s the question of which diagnostic to trust if they disagree. I haven’t learned much about regression diagnostics yet, but at the moment they all seem ad hoc (maybe because I haven’t seen the theory behind them yet).
If you say W = XxX, then make a model that’s linear in W, it’s a linear model. If you use both X and XxX, I don’t think there was a definitive answer… until Wikipedia, of course. Which says no.
Er, what? It says yes.