When programs have to work—lessons from NASA

They Write the Right Stuff is about software which “never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have achieved. Consider these stats : the last three versions of the program—each 420,000 lines long-had just one error each. The last 11 versions of this software had a total of 17 errors. Commercial programs of equivalent complexity would have 5,000 errors.”

The programmers work from 8 to 5, with occasional late nights. They wear dressy clothes, not flashy or grungy. I assume there’s a dress code, but I have no idea whether conventional clothes are actually an important part of the process. I’m sure that working reasonable numbers of hours is crucial, though I also wonder whether those hours need to be standard office hours.

“And the culture is equally intolerant of creativity, the individual coding flourishes and styles that are the signature of the all-night software world. “People ask, doesn’t this process stifle creativity? You have to do exactly what the manual says, and you’ve got someone looking over your shoulder,” says Keller. “The answer is, yes, the process does stifle creativity.” ” I have no idea what’s in the manual, or if there can be a manual for something as new as self-optimizing AI. I assume there could be a manual for some aspects.

What follows is main points quoted from the article:

The important thing is the process: The product is only as good as the plan for the product. About one-third of the process of writing software happens before anyone writes a line of code.

2. The best teamwork is a healthy rivalry. The central group breaks down into two key teams: the coders—the people who sit and write code—and the verifiers—the people who try to find flaws in the code. The two outfits report to separate bosses and function under opposing marching orders. The development group is supposed to deliver completely error-free code, so perfect that the testers find no flaws at all. The testing group is supposed to pummel away at the code with flight scenarios and simulations that reveal as many flaws as possible. The result is what Tom Peterson calls “a friendly adversarial relationship.”

I note that it’s rivalry between people who are doing different things, not people competing to get control of a project.

3. The database is the software base.

One is the history of the code itself—with every line annotated, showing every time it was changed, why it was changed, when it was changed, what the purpose of the change was, what specifications documents detail the change. Everything that happens to the program is recorded in its master history. The genealogy of every line of code -- the reason it is the way it is—is instantly available to everyone.

The other database—the error database—stands as a kind of monument to the way the on-board shuttle group goes about its work. Here is recorded every single error ever made while writing or working on the software, going back almost 20 years. For every one of those errors, the database records when the error was discovered; what set of commands revealed the error; who discovered it; what activity was going on when it was discovered—testing, training, or flight. It tracks how the error was introduced into the program; how the error managed to slip past the filters set up at every stage to catch errors—why wasn’t it caught during design? during development inspections? during verification? Finally, the database records how the error was corrected, and whether similar errors might have slipped through the same holes.

The group has so much data accumulated about how it does its work that it has written software programs that model the code-writing process. Like computer models predicting the weather, the coding models predict how many errors the group should make in writing each new version of the software. True to form, if the coders and testers find too few errors, everyone works the process until reality and the predictions match.

4. Don’t just fix the mistakes—fix whatever permitted the mistake in the first place.

The process is so pervasive, it gets the blame for any error—if there is a flaw in the software, there must be something wrong with the way its being written, something that can be corrected. Any error not found at the planning stage has slipped through at least some checks. Why? Is there something wrong with the inspection process? Does a question need to be added to a checklist?

Importantly, the group avoids blaming people for errors. The process assumes blame—and it’s the process that is analyzed to discover why and how an error got through. At the same time, accountability is a team concept: no one person is ever solely responsible for writing or inspecting code. “You don’t get punished for making errors,” says Marjorie Seiter, a senior member of the technical staff. “If I make a mistake, and others reviewed my work, then I’m not alone. I’m not being blamed for this.”