In software engineering (I’m speaking only as someone who writes software as-needed and has friends in professional software development, not as an expert myself), one of the problems is that an engineer or analyst will believe they have solved a particular software problem prematurely. Just because their code compiles and gives the result they expected on the simple inputs they can think of off the top of their head doesn’t mean it is ready to be shipped to the customer. For that, one needs to design suites of tests that check at various levels of resolution for bugs and mistakes in a systematic way. You need to have a game plan for designing these tests long before you finish writing the code, and you need to ruthlessly apply the standards of the tests across the board to your finished products.
In many places (academia is a large one, government labs is another I have worked in) this sort of unit testing is just ignored. Analysts write software until they personally feel like it’s done. One other human being might scan their eyes over it before it is declared “check in ready” and is being used to tell Air Force policy experts how to interpret radar results. I think we should all understand that this is really bad and happens many thousands of times every single day. I can’t tell you how many times I have discovered bugs in academic software, upon which award-winning research papers were based. Rarely do journals require you to submit the code and often you only have to “describe your algorithms mathematically” in the papers, and a ton of important subjective choices that some researcher made when analyzing the data get lost in translation.
I’m merely drawing a comparison between this and Bayesian data analysis. A lot of researchers tend to automatically believe their analysis is unassailably “rational” just because they had the foresight to use Bayesian methods rather than standard hypothesis testing. But this isn’t so. Extremely implausible prior distributions can to a large extent be detected. Independence assumptions in the model can also be checked by bootstrapping useful test statistics from the posterior. These are simple things that almost everyone should be doing in a regular, ruthlessly systematic way any time they want to declare a statistical success in their research.
As far as I’m concerned, model checking and unit testing are the “hygiene” of computational research.
However, there is one important difference between the software world and the statistical modelling world. While it is sometimes possible to produce a “bug-free” piece of software, it is never possible to formulate a statistical model that captures reality exactly; as Box said, “all models are wrong.” The challenge in statistical modelling is to find a model which is the best trade off between convenience (conceptual, mathematical, or computational) and verisimilitude. “Model checking” of some form or another is essential to this process; but it doesn’t necessarily have to be standardized in a form analogous to a unit test. An alternative end towards the same means is an increased emphasis on model selection for different models of the same data, which can be put into a formalized statistical framework, although this is difficult to do in practice and hence is not very commonly done at present.
While I think your comment is generally true, I feel that it’s almost a disservice to emphasize this point. A huge number of problems in the statistical sciences could be overcome by just a tiny bit of uniformity among model checking procedures. If it was seen as “bad form” to submit a journal article without doing some model expansion checks, or without providing test statistic analysis that goes beyond classical p-values, then the quality of publications would jump up. Even uniformity of the classical p-value testing would be helpful. I don’t really like the use of classical p-values and test statistics, but they do say something about model validity. However, even in that domain, the test statistics are not always computed correctly; the way in which they were computed is rarely reported; and there are tons of systematic errors made by folks unfamiliar with the theory behind the statistical tests. Even if we had to continue using classical hypothesis testing, but we could just get people to apply the tests in a correct, systematic way, this would be a huge improvement. I would happily wager eating a stick of butter to get a world in which I didn’t have to read statistical results and in my head be thinking, “Okay, how did these authors mess this up? Are they reporting the right thing? Did they just keep gathering data until they reached a significance level they wanted? Etc...”
Essentially, I think your comparison breaks down in one important way. While it may be possible to write software that is bug free, it’s not as easy to prove that your code is as efficient as it needs to be, or that it will generalize to new use cases. Unit testing definitely focuses on proving correctness and bug-free-ness. But another, less directly objective part of it is proving that your code is well-suited to the computational task. Why did you pick the algorithm, design pattern, or language that you chose? If you truly design unit tests well, then some of the tests will also address slightly higher level issues like these, which are closer to the model checking issues.
Also, I think the flip-side to the Box quote is just as important: “All models are right; most are useless.” This is discussed here.
In software engineering (I’m speaking only as someone who writes software as-needed and has friends in professional software development, not as an expert myself), one of the problems is that an engineer or analyst will believe they have solved a particular software problem prematurely. Just because their code compiles and gives the result they expected on the simple inputs they can think of off the top of their head doesn’t mean it is ready to be shipped to the customer. For that, one needs to design suites of tests that check at various levels of resolution for bugs and mistakes in a systematic way. You need to have a game plan for designing these tests long before you finish writing the code, and you need to ruthlessly apply the standards of the tests across the board to your finished products.
In many places (academia is a large one, government labs is another I have worked in) this sort of unit testing is just ignored. Analysts write software until they personally feel like it’s done. One other human being might scan their eyes over it before it is declared “check in ready” and is being used to tell Air Force policy experts how to interpret radar results. I think we should all understand that this is really bad and happens many thousands of times every single day. I can’t tell you how many times I have discovered bugs in academic software, upon which award-winning research papers were based. Rarely do journals require you to submit the code and often you only have to “describe your algorithms mathematically” in the papers, and a ton of important subjective choices that some researcher made when analyzing the data get lost in translation.
I’m merely drawing a comparison between this and Bayesian data analysis. A lot of researchers tend to automatically believe their analysis is unassailably “rational” just because they had the foresight to use Bayesian methods rather than standard hypothesis testing. But this isn’t so. Extremely implausible prior distributions can to a large extent be detected. Independence assumptions in the model can also be checked by bootstrapping useful test statistics from the posterior. These are simple things that almost everyone should be doing in a regular, ruthlessly systematic way any time they want to declare a statistical success in their research.
As far as I’m concerned, model checking and unit testing are the “hygiene” of computational research.
However, there is one important difference between the software world and the statistical modelling world. While it is sometimes possible to produce a “bug-free” piece of software, it is never possible to formulate a statistical model that captures reality exactly; as Box said, “all models are wrong.” The challenge in statistical modelling is to find a model which is the best trade off between convenience (conceptual, mathematical, or computational) and verisimilitude. “Model checking” of some form or another is essential to this process; but it doesn’t necessarily have to be standardized in a form analogous to a unit test. An alternative end towards the same means is an increased emphasis on model selection for different models of the same data, which can be put into a formalized statistical framework, although this is difficult to do in practice and hence is not very commonly done at present.
While I think your comment is generally true, I feel that it’s almost a disservice to emphasize this point. A huge number of problems in the statistical sciences could be overcome by just a tiny bit of uniformity among model checking procedures. If it was seen as “bad form” to submit a journal article without doing some model expansion checks, or without providing test statistic analysis that goes beyond classical p-values, then the quality of publications would jump up. Even uniformity of the classical p-value testing would be helpful. I don’t really like the use of classical p-values and test statistics, but they do say something about model validity. However, even in that domain, the test statistics are not always computed correctly; the way in which they were computed is rarely reported; and there are tons of systematic errors made by folks unfamiliar with the theory behind the statistical tests. Even if we had to continue using classical hypothesis testing, but we could just get people to apply the tests in a correct, systematic way, this would be a huge improvement. I would happily wager eating a stick of butter to get a world in which I didn’t have to read statistical results and in my head be thinking, “Okay, how did these authors mess this up? Are they reporting the right thing? Did they just keep gathering data until they reached a significance level they wanted? Etc...”
Essentially, I think your comparison breaks down in one important way. While it may be possible to write software that is bug free, it’s not as easy to prove that your code is as efficient as it needs to be, or that it will generalize to new use cases. Unit testing definitely focuses on proving correctness and bug-free-ness. But another, less directly objective part of it is proving that your code is well-suited to the computational task. Why did you pick the algorithm, design pattern, or language that you chose? If you truly design unit tests well, then some of the tests will also address slightly higher level issues like these, which are closer to the model checking issues.
Also, I think the flip-side to the Box quote is just as important: “All models are right; most are useless.” This is discussed here.