Aligned AI, The Scientist

The problem with constructing an aligned AI is that any active utility function or attempt at world optimization is likely to succumb to the Goodhart’s law in one of its many forms, as discussed here and elsewhere by the good people of MIRI. I wonder if a more passive approach is worth considering, or may have been considered already.

Humanity is a part of the Universe, and building and organizing accurate knowledge about the Universe is what science is. Not using the scientific knowledge to advance specific interests or goals, e.g. technological advancements or personal gain, but for the knowledge’s sake. Such a scientifically-minded agent would not be interested in modifying the Universe, and would limit any effects to the minimum needed to understand it. A part of this scientific research would be to understand humanity as deeply as possible, including what we humans imagine an aligned AI would look like even though we do not fully understand it ourselves at this point.

Presumably at some point such an AI would understand the universe and the humans in it enough to basically serve as a safe DWIM (do what I mean) genie. It would be inherently safe because doing anything unsafe, or agreeing to do anything unsafe would mean that the genie does not understand the part of the Universe that is the humanity. After all, we would not want to do anything that has unsafe and unintended consequences. “Unsafe” includes doing nothing at all: an AI that would prevent humans from doing anything would not understand humans, and so would not understand the universe. In other words

Aligned AI is AI the scientist, not AI the engineer.

This is, of course, is easier said than done. Learning all about the world while actively minimizing any impact on the world is something that we humans often strive to do when trying to understand the ecosystem of the Earth, with mixed results. Still, sometimes we succeed, and, odds are, so could an agent smarter than us.