Random Developer comments on Robust Software Isn’t About Error Handling

Random Developer 28 Nov 2025 1:50 UTC
8 points
4
I have run large, complex software systems in production. And if I could give people one piece of advice, it’s that there are no easy answers. Here are handful of examples of things won’t solve all your problems:
- Staging environments are great, but no matter how you try, they’ll always differ from production in some small way: different hostnames, much simpler user data in the staging environment, disabled email subsystems, different 3rd party auth system config, something. And sooner or later, you’ll get burnt.
- Restarting failed subsystems sounds like a clever idea. But when subsystems are failing because load is too high, then restarting failed subsystems usually makes internal load higher, often catastrophically so.
- “We have a backup system Y that acts as a placeholder when system X goes down.” Yes, but since nobody ever uses system Y in production under real load, it turns out that the config bit-rotted 6 months ago, and nobody noticed.
I could go on a for a while.

But then, once you have become deeply pessimistic and paranoid, you finally build a system that runs flawlessly for 4 years. It never needs bug fixes. It never needs attention. It just sits in a corner, doing its job perfectly. In fact, people forget how it works. Then they forget that it’s there. People move on. Compiler versions get upgraded. Your CI system gets replaced. Management goes through two different wiki initiatives, losing information each time.

And then one day, someone deprecates an old version of TLS, and ancient system stops being able to talk to some API. And then the world burns.

So if your model is “a single unwrap shouldn’t bring you down, because you should obviously have been doing A, B and C elsewhere”, then you’re probably just trading off different kinds of disasters.

A better model is “We’ll fix potential failures at every possible level. And hopefully, when the shit finally hits the fan, then at least one of those levels might hold successfully.” So you have staging and monitoring and fallback systems and extensive testing and a chaos monkey and documentation and API “fuses” and back-pressure and load-shedding and proofs and paranoid code reviews and incremental rollout and root cause analysis. And so you fail less, and less, and less. But one day, that fact that you wrote an actual, sensible behavior for 21 items and tested it? That will be what prevents some ludicrous cascading failure.

TL;dr: It’s nice that you have multiply redudant horseback messengers. But still, check the nails in their horseshoes regularly, because for want of a nail, an exciting new complex failure mode was discovered, and the battle was still lost.

This has multiple implications for AI alignment, and few of them are good news.