The E-Coli Test for AI Alignment
Let’s say you have an idea in mind for how to align an AI with human values.
Go prep a slide with some e-coli, put it under a microscope, and zoom in until you can see four or five cells. Your mission: satisfy the values of those particular e-coli. In particular, walk through whatever method you have in mind for AI alignment. You get to play the role of the AI; with your sophisticated brain, massive computing power, and large-scale resources, hopefully you can satisfy the values of a few simple e-coli cells.
Perhaps you say “this is simple, they just want to maximize reproduction rate.” Ah, but that’s not quite right. That’s optimizing for the goals of the process of evolution, not optimizing for the goals of the godshatter itself. The e-coli has some frozen-in values which have evolved to approximate evolutionary fitness maximization in some environments; your job is optimize for the frozen-in approximation, even in new environments. After all, we don’t want a strong AI optimizing for the reproductive fitness of humans—we want it optimizing for humans’ own values.
On the other hand, perhaps you say “these cells don’t have any consistent values, they’re just executing a few simple hardcoded algorithms.” Well, you know what else doesn’t have consistent values? Humans. Better be able to deal with that somehow.
Perhaps you say “these cells are too simple, they can’t learn/reflect/etc.” Well, chances are humans will have the same issue once the computational burden gets large enough.
This is the problem of AI alignment: we need to both define and optimize for the values of things with limited computational resources and inconsistent values. To see the problem from the AI’s point of view, look through a microscope.