Causal diagrams and software engineering

Fake explanations don’t feel fake. That’s what makes them dangerous. -- EY

Let’s look at “A Handbook of Software and Systems Engineering”, which purports to examine the insights from software engineering that are solidly grounded in empirical evidence. Published by the prestigious Fraunhofer Institut, this book’s subtitle is in fact “Empirical Observations, Laws and Theories”.

Now “law” is a strong word to use—the highest level to which an explanation can aspire to reach, as it were. Sometimes it’s used in a jokey manner, as in “Hofstadter’s Law” (which certainly seems often to apply to software projects). But this definitely isn’t a jokey kind of book, that much we get from the appeal to “empirical observations” and the “handbook” denomination.

Here is the very first “law” listed in the Handbook:

Requirement deficiencies are the prime source of project failures.

Previously, we observed that in the field of software engineering, a last name followed by a year, surrounded by parentheses, seems to be a magic formula for suspending critical judgment in readers.

Another such formula, it seems, is the invocation of statistical results. Brandish the word “percentage”, assert that you have surveyed a largish population, and whatever it is you claim, some people will start believing. Do it often enough and some will start repeating your claim—without bothering to check it—starting a potentially viral cycle.

As a case in point, one of the most often cited pieces of “evidence” in support of the above “law” is the well-known Chaos Report, according to which the first cause of project failure is “Incomplete Requirements”. (The Chaos Report isn’t cited as evidence by the Handbook, but it’s representative enough to serve in the following discussion. A Google Search readily attests to the wide spread of the verbatim claim in the Chaos Report; various derivatives of the claim are harder to track, but easily verified to be quite pervasive.)

Some elementary reasoning about causal inference is enough to show that the same evidence supporting the above “law” can equally well be suggested as evidence supporting this alternative conclusion:

Project failures are the primary source of requirements deficiencies.

“Wait”, you may be thinking. “Requirements are written at the start of a project, and the outcome (success or failure) happens at the end. The latter cannot be the cause of the former!”

Your thinking is correct! As the descendant of a long line of forebears who, by virtue of observing causes and effects, avoided various dangers such as getting eaten by predators, you have internalized a number of constraints on causal inference. Without necessarily having an explicit representation of these constraints, you know that at a minimum, showing a cause-effect relationship requires the following:

  • a relationship between the variable labeled “cause” and the variable labeled “effect”; it need not be deterministic (as in “Y always happens after X”) but can also be probabilistic (“association” or “correlation”)

  • the cause must have happened before the effect

  • other causes which could also explain the effect are ruled out by reasoning or observation

Yet, notoriously, we often fall prey to the failure mode of only requiring the first of these conditions to be met:

Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.

One of the more recent conceptual tools for avoiding this trap is to base one’s reasoning on formal representations of cause-effect inferences, which can then serve to suggest the quantitative relationships that will confirm (or invalidate) a causal claim. Formalizing helps us bring to bear all that we know about the structure of reliable causal inferences.

The “ruling out alternate explanations” bit turns out to be kind of a big deal. It is, in fact, a large part of the difference between “research” and “anecdote”. The reason you can’t stick the ‘science’ label on everyday observations isn’t generally because these are imprecise and merely qualitative, and observing percentages or averages is sufficient to somehow obtain science.

There are no mathematical operations which magically transform observations into valid inferences. Rather, to do “science” consists in good part of eliminating the various ways you could be fooling yourself. The hard part isn’t collecting the data; the hard part is designing the data collection, so that the data actually tells you something useful.

Here is an elementary practical application, in the context of the above “law”; let’s look at the study design. What is the Chaos Report’s methodology for establishing the widely circulated list of “top causes of software project failure”?

The Standish Group surveyed IT executive managers for their opinions about why projects succeed. (source)

The respondents to the Standish Group survey were IT executive managers. The sample included large, medium, and small companies across major industry segments: banking, securities, manufacturing, retail, wholesale, heath care, insurance, services, local, state, and federal organizations. The total sample size was 365 respondents representing 8,380 applications.

The key terms here are “survey” and “opinion”. IT executives are being interviewed, after the relevant projects have been conducted and assessed, on what they think best explains the outcomes. We can formalize this with a causal diagram. We need to show four variables, and there are some obvious causal relationships:

digraph { "Actual\n requirements quality" -> "Reported\n requirements quality" "Actual\n project outcome" -> "Reported\n project outcome" }

Note that in a survey situation, the values of the “actual” variables are not measured directly; they are only “measured” indirectly via their effects on the “reported” variables.

As surmised above, we may rule out any effect of the reported results on the actual results, since the survey takes place after the projects. However, we may not rule out an effect of the actual results on the reported results, in either direction. This is reflected in our diagram as follows:

digraph { "Actual\n requirements quality" -> "Reported\n requirements quality" "Actual\n requirements quality" -> "Reported\n project outcome" "Actual\n requirements quality" -> "Actual\n project outcome" "Actual\n project outcome" -> "Reported\n project outcome" "Actual\n project outcome" -> "Reported\n requirements quality" [color=red, penwidth=2] }

An argument for the arrow in red could be formulated this way: “An IT executive being interviewed about the reason for a project failure is less likely to implicate his own competence, and more likely to implicate the competence of some part of the organization outside of their scope of responsibility, for instance by blaming his non-IT interlocutors for poor requirements definition or insufficient involvement.” This isn’t just possible, it’s also plausible (based on what we know of human nature and corporate politics).

Hence my claim above: we can equally well argue from the evidence that “(actual) project outcomes are the primary source of (reported) requirements deficiencies”.

This is another piece of critical kit that a rationalist plying their trade in the software development business (and indeed, a rationalist anywhere) cannot afford to be without: correctly generating and labeling (if only mentally) the nodes and edges in an implied causal diagram, whenever reading about the results of an experimental or observational study. (The diagrams for this post were created using the nifty GraphViz Workspace.)

We might even see it as a candidate 5-second skill, which unlocks the really powerful habit: asking the question “which causal pathways between cause and effect, that yield an alternative explanation to the hypothesis under consideration, could be ruled out by an alternative experimental design?”

Sometimes it’s hard to come up with a suitable experimental design, and in such cases there are sophisticated mathematical techniques emerging that let you still extract good causal inferences from the imperfect data. This… isn’t one of those cases.

For instance, a differently designed survey would interview IT executives at the start of every project, and ask them the question: “do you think your user-supplied requirements are complete enough that this will not adversely impact your project?” As before, you would debrief the same executives at the end of the project. This study design yields the following causal diagram:

digraph { "Actual\n requirements quality" -> "Reported\n requirements quality" "Actual\n requirements quality" -> "Actual\n project outcome" "Actual\n project outcome" -> "Reported\n project outcome" }

Observe that we have now ruled out the argument from CYA. This is a simple enough fix, yet the industry for the most part (and despite sporadic outbreaks of common sense) persists in conducting and uncritically quoting surveys that do not block enough causal pathways to firmly establish the conclusions they report, conclusions that have been floating around, for the most part uncontested, for decades now.