Humans are very reliable agents

This post has been recorded as part of the LessWrong Curated Podcast, and an be listened to on Spotify, Apple Podcasts, and Libsyn.


Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google’s self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn’t deployed at large scale, thirteen years later. Why are these fields getting such different results?

Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. For MNIST, an easy handwriting recognition task, performance tops out at around 99.9% even for top models; it’s not very practical to design for or measure higher reliability than that, because the test set is just 10,000 images and a handful are ambiguous. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models.

By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That’s around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn’t pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I’m sure future AI will get there, but each additional “nine” of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.)

(The numbers here are only rough Fermi estimates. I’m sure one could nitpick them by going into pre-pandemic vs. post-pandemic crash rates, laws in the US vs. other countries, what percentage of crashes are drunk drivers, do drunk drivers count, how often would a really bad decision be fatal, etc. But I’m confident that whichever way you do the math, you’ll still find that humans are many orders of magnitude more reliable.)

Other types of accidents are similarly rare. Eg. pre-pandemic, there were around 40 million commercial flights per year, but only a handful of fatal crashes. If each flight involves 100 chances for the pilot to crash the plane by screwing up, then that would get you a reliability rate around 1:1,000,000,000, or ~99.99999999%.

Even obviously dangerous activities can have very low critical failure rates. For example, shooting is a popular hobby in the US; the US market buys around 10 billion rounds of ammunition per year. There are around 500 accidental gun deaths per year, so shooting a gun has a reliability rate against accidental death of ~1:20,000,000, or 99.999995%. In a military context, the accidental death rate was around ten per year against ~1 billion rounds fired, for a reliability rate of ~99.9999999%. Deaths by fire are very rare compared to how often humans use candles, stoves, and so on; New York subway deaths are rare compared to several billion annual rides; out of hundreds of millions of hikers, only a tiny percentage fall off of cliffs; and so forth.

The 2016 AI Impacts survey asked hundreds of AI researchers when they thought AI would be capable of doing certain tasks, playing poker, proving theorems and so on. Some tasks have been solved or have a solution “in sight”, but right now, we’re nowhere close to an AI that can replace human surgeons; robot-assisted surgeries still have manual control by human operators. Cosmetic surgeries on healthy patients have a fatality rate around 1:300,000, even before excluding unpredictable problems like blood clots. If a typical procedure involves two hundred chances to kill the patient by messing up, then an AI surgeon would need a reliability rate of at least 99.999998%.

One concern with GPT-3 has been that it might accidentally be racist or offensive. Humans are, of course, sometimes racist or offensive, but in a tightly controlled Western professional context, it’s pretty rare. Eg., one McDonald’s employee was fired for yelling racial slurs at a customer. But McDonald’s serves 70 million people a day, ~1% of the world’s population. Assuming that 10% of such incidents get a news story and there’s about one story per year, a similar language model would need a reliability rate of around 1:2,500,000,000, or 99.99999996%, to match McDonald’s workers. When I did AI for the McDonald’s drive-thru, the language model wasn’t allowed to generate text at all. All spoken dialog had to be pre-approved and then manually engineered in. Reliability is hard!

On the one hand, this might seem slightly optimistic for AI alignment research, since commercial AI teams will have to get better worst-case bounds on AI behavior for immediate economic reasons. On the other hand, because so much of the risk of AI is concentrated into a small number of very bad outcomes, it seems like such engineering might get us AIs that appear safe, and almost always are safe, but will still cause catastrophic failure in conditions that weren’t anticipated. That seems bad.