The steering problem

Most work on AI safety starts with a broad, vague problem (“How can we make an AI do good things?”) and relatively quickly moves to a narrow, precise problem (e.g. “What kind of reasoning process trusts itself?“).

Precision facilitates progress, and many serious thinkers are skeptical of imprecision. But in narrowing the problem too far we do most of the work (and have most of the opportunity for error).

I am interested in more precise discussion of the big-picture problem of AI control. Such discussion could improve our understanding of AI control, help us choose the right narrow questions, and be a better starting point for engaging other researchers. To that end, consider the following problem:

The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?

I recently wrote this document, which defines this problem much more precisely (in section 2) and considers a few possible approaches (in section 4). As usual, I appreciate thoughts and criticism. I apologize for the proliferation of nomenclature, but I couldn’t get by without a new name.

I think the steering problem captures a large part of what most people think of as “the AI safety problem.” It certainly does not capture the entire problem; in particular, we might well introduce undesired goal-directed behavior in the process of implementing human-level capabilities (either inadvertently or because it’s the easiest way to produce human-level abilities).

Since I’ve started thinking more explicitly about the steering problem, I’ve reduced my estimate of its difficulty. This leads me to be more optimistic about AI safety, but also to suspect that the steering problem is a smaller share of the whole problem than I’d originally thought. It would be great to see a more precise statement of the rest of the problem (which would probably subsume the steering problem). I’m afraid that the rest of the problem is more closely tied to the particular techniques used to produce AI, so that we probably can’t state it precisely without making some additional assumptions.

I have recently been thinking about the situation for deep learning: under the (very improbable) assumption that various deep learning architectures could yield human-level AI, could they also yield a system with human-level usefulness? I’m optimistic that we can find at least one natural assumption for which the answer is “yes,” which I would consider significant further progress.