However, there is one important difference between the software world and the statistical modelling world. While it is sometimes possible to produce a “bug-free” piece of software, it is never possible to formulate a statistical model that captures reality exactly; as Box said, “all models are wrong.” The challenge in statistical modelling is to find a model which is the best trade off between convenience (conceptual, mathematical, or computational) and verisimilitude. “Model checking” of some form or another is essential to this process; but it doesn’t necessarily have to be standardized in a form analogous to a unit test. An alternative end towards the same means is an increased emphasis on model selection for different models of the same data, which can be put into a formalized statistical framework, although this is difficult to do in practice and hence is not very commonly done at present.
While I think your comment is generally true, I feel that it’s almost a disservice to emphasize this point. A huge number of problems in the statistical sciences could be overcome by just a tiny bit of uniformity among model checking procedures. If it was seen as “bad form” to submit a journal article without doing some model expansion checks, or without providing test statistic analysis that goes beyond classical p-values, then the quality of publications would jump up. Even uniformity of the classical p-value testing would be helpful. I don’t really like the use of classical p-values and test statistics, but they do say something about model validity. However, even in that domain, the test statistics are not always computed correctly; the way in which they were computed is rarely reported; and there are tons of systematic errors made by folks unfamiliar with the theory behind the statistical tests. Even if we had to continue using classical hypothesis testing, but we could just get people to apply the tests in a correct, systematic way, this would be a huge improvement. I would happily wager eating a stick of butter to get a world in which I didn’t have to read statistical results and in my head be thinking, “Okay, how did these authors mess this up? Are they reporting the right thing? Did they just keep gathering data until they reached a significance level they wanted? Etc...”
Essentially, I think your comparison breaks down in one important way. While it may be possible to write software that is bug free, it’s not as easy to prove that your code is as efficient as it needs to be, or that it will generalize to new use cases. Unit testing definitely focuses on proving correctness and bug-free-ness. But another, less directly objective part of it is proving that your code is well-suited to the computational task. Why did you pick the algorithm, design pattern, or language that you chose? If you truly design unit tests well, then some of the tests will also address slightly higher level issues like these, which are closer to the model checking issues.
Also, I think the flip-side to the Box quote is just as important: “All models are right; most are useless.” This is discussed here.
However, there is one important difference between the software world and the statistical modelling world. While it is sometimes possible to produce a “bug-free” piece of software, it is never possible to formulate a statistical model that captures reality exactly; as Box said, “all models are wrong.” The challenge in statistical modelling is to find a model which is the best trade off between convenience (conceptual, mathematical, or computational) and verisimilitude. “Model checking” of some form or another is essential to this process; but it doesn’t necessarily have to be standardized in a form analogous to a unit test. An alternative end towards the same means is an increased emphasis on model selection for different models of the same data, which can be put into a formalized statistical framework, although this is difficult to do in practice and hence is not very commonly done at present.
While I think your comment is generally true, I feel that it’s almost a disservice to emphasize this point. A huge number of problems in the statistical sciences could be overcome by just a tiny bit of uniformity among model checking procedures. If it was seen as “bad form” to submit a journal article without doing some model expansion checks, or without providing test statistic analysis that goes beyond classical p-values, then the quality of publications would jump up. Even uniformity of the classical p-value testing would be helpful. I don’t really like the use of classical p-values and test statistics, but they do say something about model validity. However, even in that domain, the test statistics are not always computed correctly; the way in which they were computed is rarely reported; and there are tons of systematic errors made by folks unfamiliar with the theory behind the statistical tests. Even if we had to continue using classical hypothesis testing, but we could just get people to apply the tests in a correct, systematic way, this would be a huge improvement. I would happily wager eating a stick of butter to get a world in which I didn’t have to read statistical results and in my head be thinking, “Okay, how did these authors mess this up? Are they reporting the right thing? Did they just keep gathering data until they reached a significance level they wanted? Etc...”
Essentially, I think your comparison breaks down in one important way. While it may be possible to write software that is bug free, it’s not as easy to prove that your code is as efficient as it needs to be, or that it will generalize to new use cases. Unit testing definitely focuses on proving correctness and bug-free-ness. But another, less directly objective part of it is proving that your code is well-suited to the computational task. Why did you pick the algorithm, design pattern, or language that you chose? If you truly design unit tests well, then some of the tests will also address slightly higher level issues like these, which are closer to the model checking issues.
Also, I think the flip-side to the Box quote is just as important: “All models are right; most are useless.” This is discussed here.