Research Agenda in reverse: what *would* a solution look like?

I con­structed my AI al­ign­ment re­search agenda piece by piece, stum­bling around in the dark and go­ing down many false and true av­enues.

But now it is in­creas­ingly start­ing to feel nat­u­ral to me, and in­deed, some­what in­evitable.

What do I mean with that? Well, let’s look at the prob­lem in re­verse. Sup­pose we had an AI that was al­igned with hu­man val­ues/​prefer­ences. How would you ex­pect that to have been de­vel­oped? I see four nat­u­ral paths:

  1. Effec­tive proxy meth­ods. For ex­am­ple, Paul’s am­plifi­ca­tion and dis­til­la­tion, or var­i­ants of re­vealed prefer­ences, or a similar ap­proach. The point of this that it reaches al­ign­ment with­out defin­ing what a prefer­ence fun­da­men­tally is; in­stead it uses some proxy for the prefer­ence to do the job.

  2. Cor­rigi­bil­ity: the AI is safe and cor­rigible, and along with ac­tive hu­man guidance, man­ages to reach a tol­er­able out­come.

  3. Some­thing new: a bold new method that works, for rea­sons we haven’t thought of to­day (this in­cludes most strains of moral re­al­ism).

  4. An ac­tual grounded defi­ni­tion of hu­man prefer­ences.

So, if we fo­cus on sce­nario 4, we need a few things. We need a fun­da­men­tal defi­ni­tion of what a hu­man prefer­ence is (since we know this can’t be defined purely from be­havi­our). We need a method of com­bin­ing con­tra­dic­tory and un­der­defined hu­man prefer­ences. We also need a method for tak­ing into ac­count hu­man meta-prefer­ences. And both these meth­ods has to ac­tu­ally reach an out­put, and not get caught in loops.

If those are the re­quire­ments, then it’s ob­vi­ous why we need most of the el­e­ments of my re­search agenda, or some­thing similar. We don’t need the ex­act meth­ods sketched out there, there may be other way of syn­the­sis­ing prefer­ences and meta-prefer­ences to­gether. But the over­all struc­ture—a way of defin­ing prefer­ences, and ways of com­bin­ing them that pro­duce an out­put—seems, in ret­ro­spect, in­evitable. The rest is, to some ex­tent, just im­ple­men­ta­tion de­tails.