Beware of black boxes in AI alignment research

Over the course of the AI Alignment Prize I’ve sent out lots of feedback emails. Some of the threads were really exciting and taught me a lot. But mostly it was me saying pretty much the same thing over and over with small variations. I’ve gotten pretty good at saying that thing, so it makes sense to post it here.

Working on AI alignment is like building a bridge across a river. Our building blocks are ideas like mathematics, physics or computer science. We understand how they work and how they can be used to build things.

Meanwhile on the far side of the river, we can glimpse other building blocks that we imagine we understand. Desire, empathy, comprehension, respect… Unfortunately we don’t know how these work inside, so from the distance they look like black boxes. They would be very useful for building the bridge, but to reach them we must build the bridge first, starting from our side of the river.

What about machine learning? Perhaps the math of neural networks could free us from the need to understand the building blocks we use? If we could create a behavioral imitation of that black box over there (like “human values”), then building the bridge would be easy! Unfortunately, ideas like adversarial examples, treacherous turns or nonsentient uploads show that we shouldn’t bet our future on something that imitates a particular black box, even if the imitation passes many tests. We need to understand how the black box works inside, to make sure our version’s behavior is not only similar but based on the right reasons.

(Eliezer made the same point more eloquently a decade ago in Artificial Mysterious Intelligence. Still, with the second round of our prize now open, I feel it’s worth saying again.)