[deleted] comments on When programs have to work—lessons from NASA

[deleted] 31 Jul 2011 20:30 UTC
14 points

They Write the Right Stuff is about software which “never crashes. It never needs to be re-booted....”

This is incorrect in an interesting way.

There’s a famous story, among people who study Apollo history, of the 1201 and 1202 program alarms that occurred during Apollo 11, as described here and here. Those links are short and well worth reading in their entirety, but here’s a summary:

Apollo 11′s guidance computer had incredibly limited hardware by modern standards. When you read its specs, if you know anything about computers, you will not believe that people’s lives were trusted to something so primitive. As Neil and Buzz were performing their powered descent to the Moon, the guidance computer started emitting obscure “1201” and “1202″ program alarms that they had never seen before. Significant computer problems at this stage, hovering over the moon with only minutes of fuel to spare, mean that the astronauts should abort and return to orbit, instead of attempting to land and crashing due to broken software. The program experts quickly determined that the alarms were ignorable, and the mission proceeded. As it turned out, the astronauts had been incorrectly trained to leave a switch on, which fed radar data to the computer that it shouldn’t have been getting (and the switch wasn’t connected to a real computer during training so this wasn’t noticed). This overloaded the computer, which had too much data to process given its hard real-time constraints. Then it did something that would be amazing in this era, much less 1969:

On Apollo 11, each time a 1201 or 1202 alarm appeared, the computer rebooted, restarted the important stuff, like steering the descent engine and running the DSKY to let the crew know what was going on, but did not restart all the erroneously-scheduled rendezvous radar jobs. The NASA guys in the MOCR knew—because MIT had extensively tested the restart capability—that the mission could go forward.

This auto-restart ability, combined with prioritization, allowed the computer to (literally literally) reboot every 10 seconds, while continuing to handle tasks whose failure would kill the astronauts, and dropping less important tasks.

The thing about space software is that it’s enormously, insanely expensive in real terms (i.e. it requires lots of time from lots of skilled people). Ordinary software (desktop, server, phone, console, you name it) is cheaper, bigger, and evolves more rapidly. It’s also buggier, but its bugs typically don’t kill people and rarely cost a billion dollars. NASA has done things wrong, but their approach to software is perfectly suited to their requirements.
- sketerpot 31 Jul 2011 21:19 UTC
  11 points
  Parent
  
  Then it did something that would be amazing in this era, much less 1969: [snip description of reboot]
  
  That’s not really amazing. It’s par for the course for modern microcontrollers, of the sort that litter the innards of modern cars and tractors and such. They usually keep their programs in NOR Flash memory, so they don’t need to be read from a hard drive on start-up, and don’t need to keep much state in volatile memory. And they are usually designed to be able to start up in the blink of an eye. There are fairly cheap microcontrollers with better specs than the Apollo Guidance Computer, and they’re common in applications that need reliable embedded software. It’s a safe bet that the private space industry uses quite a lot of them. And the job prioritization is typical for any system designed to be hard realtime.
  
  Even in big computers like the one on your desk, failing really quickly and well can help with reliability. There’s a school of thought in server design which says that servers should consist of large numbers of isolated parts, which crash if anything goes wrong, and can be rebooted very quickly. This is how most web sites stay up despite bugs, random crashes, and server failures.
  - jhuffman 1 Aug 2011 21:20 UTC
    2 points
    Parent
    I think what is interesting is not the reboot but the fact that it every task was prioritized and unimportant ones were inherently discarded. I do not think this is a feature typical to embedded programming.
    - sketerpot 2 Aug 2011 4:53 UTC
      2 points
      Parent
      That’s actually a very common realtime scheduling algorithm: execute the highest-priority task ready to run at any time, and discard the lower-priority tasks if you don’t have time for them. It’s popular because of situations exactly like the one the Apollo Guidance Computer ran into.
- jhuffman 1 Aug 2011 21:19 UTC
  0 points
  Parent
  This wasn’t actually a computer software failure. It was a failure of procedure development. Also it suggests their training should also be a high-fidelity simulation test, as that would have found this problem on the ground right away. So its maybe a testing failure but even then not a testing failure for the software but for the entire landing system (considering hardware, software and human procedures).
  - [deleted] 2 Aug 2011 3:58 UTC
    1 point
    Parent
    
    This wasn’t actually a computer software failure.
    
    I didn’t say it was.