Assuming we’ve solved X, could we do Y...

The year is 1933. Leó Szilárd has just hy­pothised the nu­clear chain re­ac­tion. Wor­ried re­searchers from proto-MIRI or proto-FHI ask them­selves “as­sum­ing we’ve solved the is­sue of nu­clear chain re­ac­tions in prac­tice, could we build a nu­clear bomb out of it”?

Well, what do we mean by “as­sum­ing we’ve solved the is­sue of nu­clear chain re­ac­tions”? Does it mean that “we have some de­tailed plans for vi­able nu­clear bombs, in­clud­ing all the calcu­la­tions needed to make them work, and ev­ery­thing in the plans is doable by a rich in­dus­trial state”? In that case, the an­swer to “could we build a nu­clear bomb out of it?” is a sim­ple and triv­ial yes.

Alter­na­tively, are we sim­ply as­sum­ing “there ex­ists a col­lec­tion of mat­ter that sup­ports a chain re­ac­tion”? In which case, note that the as­sump­tion is (al­most) com­pletely use­less. In or­der to figure out whether a nu­clear bomb is build­able, we still need to figure out all the de­tails of chain re­ac­tions—that as­sump­tion has bought us noth­ing.

As­sum­ing hu­man val­ues...

At the re­cent AI safety un­con­fer­ence, David Krueger wanted to test, em­piri­cally, whether de­bate meth­ods could be used for cre­at­ing al­igned AIs. At some point in the dis­cus­sion, he said “let’s as­sume the ques­tion of defin­ing hu­man val­ues is solved”, want­ing to move on to whether a de­bate-based AI could then safely im­ple­ment it.

But as above, when we as­sume that an un­der­defined defi­ni­tion prob­lem (hu­man val­ues) is solved, we have to be very care­ful what we mean—the as­sump­tion might be use­less, or might be too strong, and end up solv­ing the im­ple­men­ta­tion prob­lem en­tirely.

In the con­ver­sa­tion with David, we were imag­in­ing a defi­ni­tion of hu­man val­ues re­lated to what hu­mans would an­swer if we could re­flex­ively pon­der spe­cific ques­tions for thou­sands of years. One could ob­ject to that defi­ni­tion on the grounds that peo­ple can be co­erced or tricked into giv­ing the an­swers that the AI might want—hence the cir­cum­stances of that pon­der­ing is crit­i­cal.

If we as­sume X=”hu­man val­ues are defined in this way”, could an AI safely im­ple­ment X via de­bate meth­ods? Well, what about co­er­cion and trick­ery by the AI dur­ing the de­bate pro­cess? It could be that X doesn’t help at all, be­cause we still have to re­solve all of the same is­sues.

Or, con­versely, X might be too strong—it might define what trick­ery is, which solves a lot of the im­ple­men­ta­tion prob­lem for free. Or, in the ex­treme case, maybe X is ex­pressed in com­puter code, and solve all the con­tra­dic­tions within hu­mans, deal­ing with on­tol­ogy is­sues, pop­u­la­tion changes, what an agent is, and all other sub­tleties. Then the ques­tion “given X, could an AI safely im­ple­ment it?” re­duces to “can the AI run code?”

In sum­mary, when the is­sue is un­der­defined, the bound­ary be­tween defi­ni­tion and im­ple­men­ta­tion is very un­clear, and as­sum­ing that one of them is solved is very un­clear.

How to as­sume (for the good of all of us)

The ob­vi­ous way around this is­sue is to be care­ful and pre­cise in what we’re as­sum­ing. So, for ex­am­ple, we might as­sume “we have an al­gorithm A, if run for a decade, would com­pute what hu­mans would de­cide af­ter a thou­sand years of de­bate”. Then we have two prac­ti­cal and well defined sub­prob­lems to work on: can we ap­prox­i­mate the out­put of A within rea­son­able time, and is “what hu­mans would de­cide af­ter a thou­sand years of de­bate” a good defi­ni­tion of hu­man val­ues?

Another op­tion, when we lack a full defi­ni­tion, is to fo­cus on some of the prop­er­ties of that defi­ni­tion that we feel are cer­tain or likely. For ex­am­ple, we can as­sume that “the to­tal ex­tinc­tion of the all in­tel­li­gent be­ings through­out the cos­mos” is not a de­sir­able fea­ture ac­cord­ing to most hu­man val­ues, and ar­gue whether de­bate meth­ods will lead to that out­come. Or, at smaller scale, we might as­sume that tel­ling us in­for­ma­tive truths is com­pat­i­ble with our val­ues, and check whether the de­bate AI would do that.