It’s really useful to ask the simple question “what tests could have caught the most costly bugs we’ve had?”
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like “a bug in our vendor’s code caused them to send us numbers denominated in pounds instead of dollars”. This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.
I really enjoyed this study. I wish it weren’t so darn expensive, because I would love to see a dozen variations of this.
I still think I’m more productive with LLMs since Claude Code + Opus 4.0 (and have reasonably strong data points), but this does push me further in the direction of using LLMs only surgically rather than for everything, and towards recommending relatively restricted LLM use at my company.