Assuming we’ve solved X, could we do Y...

The year is 1933. Leó Szilárd has just hypothised the nuclear chain reaction. Worried researchers from proto-MIRI or proto-FHI ask themselves “assuming we’ve solved the issue of nuclear chain reactions in practice, could we build a nuclear bomb out of it”?

Well, what do we mean by “assuming we’ve solved the issue of nuclear chain reactions”? Does it mean that “we have some detailed plans for viable nuclear bombs, including all the calculations needed to make them work, and everything in the plans is doable by a rich industrial state”? In that case, the answer to “could we build a nuclear bomb out of it?” is a simple and trivial yes.

Alternatively, are we simply assuming “there exists a collection of matter that supports a chain reaction”? In which case, note that the assumption is (almost) completely useless. In order to figure out whether a nuclear bomb is buildable, we still need to figure out all the details of chain reactions—that assumption has bought us nothing.

Assuming human values...

At the recent AI safety unconference, David Krueger wanted to test, empirically, whether debate methods could be used for creating aligned AIs. At some point in the discussion, he said “let’s assume the question of defining human values is solved”, wanting to move on to whether a debate-based AI could then safely implement it.

But as above, when we assume that an underdefined definition problem (human values) is solved, we have to be very careful what we mean—the assumption might be useless, or might be too strong, and end up solving the implementation problem entirely.

In the conversation with David, we were imagining a definition of human values related to what humans would answer if we could reflexively ponder specific questions for thousands of years. One could object to that definition on the grounds that people can be coerced or tricked into giving the answers that the AI might want—hence the circumstances of that pondering is critical.

If we assume X=”human values are defined in this way”, could an AI safely implement X via debate methods? Well, what about coercion and trickery by the AI during the debate process? It could be that X doesn’t help at all, because we still have to resolve all of the same issues.

Or, conversely, X might be too strong—it might define what trickery is, which solves a lot of the implementation problem for free. Or, in the extreme case, maybe X is expressed in computer code, and solve all the contradictions within humans, dealing with ontology issues, population changes, what an agent is, and all other subtleties. Then the question “given X, could an AI safely implement it?” reduces to “can the AI run code?”

In summary, when the issue is underdefined, the boundary between definition and implementation is very unclear, and assuming that one of them is solved is very unclear.

How to assume (for the good of all of us)

The obvious way around this issue is to be careful and precise in what we’re assuming. So, for example, we might assume “we have an algorithm A, if run for a decade, would compute what humans would decide after a thousand years of debate”. Then we have two practical and well defined subproblems to work on: can we approximate the output of A within reasonable time, and is “what humans would decide after a thousand years of debate” a good definition of human values?

Another option, when we lack a full definition, is to focus on some of the properties of that definition that we feel are certain or likely. For example, we can assume that “the total extinction of the all intelligent beings throughout the cosmos” is not a desirable feature according to most human values, and argue whether debate methods will lead to that outcome. Or, at smaller scale, we might assume that telling us informative truths is compatible with our values, and check whether the debate AI would do that.