Models of preferences in distant situations

Note: work­ing on a re­search agenda, hence the large amount of small in­di­vi­d­ual posts, to have things to link to in the main doc­u­ments.

For X, con­sider three differ­ent par­tial prefer­ences:

  1. If X were poor, they would pri­ori­tise con­sump­tion over sav­ing.

  2. X: If I were poor, I would pri­ori­tise sav­ing over con­sump­tion.

  3. X: If I were poor, I’d get my per­sonal ac­coun­tant to ad­vise me on the best sav­ing/​con­sump­tion plan for poor peo­ple.

1 is what X’s judge­ment would be in a differ­ent, dis­tant situ­a­tion. 2 is what X’s cur­rent judge­ment about what their judge­ment would be in that situ­a­tion. 3 is similar, but is based on a fac­tu­ally wrong model of what that dis­tant situ­a­tion is.

So what are we to make of these in terms of X’s prefer­ences? 3 can be dis­counted as fac­tu­ally in­cor­rect. 2 is a cor­rect in­ter­pre­ta­tion of X’s cur­rent (meta-)prefer­ences over that dis­tant situ­a­tion, but we know that these will change if they ac­tu­ally reach that situ­a­tion. It might be tempt­ing to see 1 as the gen­uine prefer­ence, but that’s tricky. It’s a prefer­ence that X doesn’t have, and may never have. Even if X were cer­tain to end up poor, their prefer­ence may de­pend on the path that they took to get there—med­i­cal bankruptcy, al­co­holism, or one du­bi­ous in­vest­ment, could re­sult in differ­ent prefer­ences. And that’s with­out con­sid­er­ing the differ­ent ways the AI could put X in that situ­a­tion—we don’t want the AI to in­fluence its own learn­ing pro­cess by in­di­rectly de­ter­min­ing the prefer­ences it will max­imise.

So, es­sen­tially, us­ing 1 is a prob­lem be­cause the prefer­ence is many steps re­moved and can be in­fluenced by the AI (though that last is­sue may have solu­tions). Us­ing 2 is a prob­lem be­cause the cur­rent (meta-)prefer­ences are pro­jected into a situ­a­tion where they would be wrong. This can end up with some­one railing against the prefer­ences of their past self, even if those prefer­ences now con­strain them. This is, in essence, a par­tial ver­sion of the Gödel-like prob­lem men­tioned her, where the hu­man rebels against the prefer­ences the AI has de­ter­mined them to have.

So, what is the best way of figur­ing out X’s “true” prefer­ences? This is one of the things that we ex­pect the sys­tem to be ro­bust to. Whether type 1 or type 2 prefer­ences are pri­ori­tised, the syn­the­sis should still reach an ac­cept­able out­come. And the re­bel­lion against the syn­the­sised val­ues is a gen­eral prob­lem with these meth­ods, and should be solved in some way or an­other—pos­si­bly by the hu­man agree­ing to freeze their prefer­ences un­der the par­tial guidance of the AI.

Avoid am­bigu­ous dis­tant situation

If the syn­the­sis of X’s prefer­ences in situ­a­tion S is am­bigu­ous, that might be an ar­gu­ment to avoid situ­a­tion S en­tirely. For ex­am­ple, sup­pose S in­volves very lossy up­loads of cur­rent hu­mans, so that the up­loads seem pretty similar to the origi­nal hu­man but not iden­ti­cal. Rather than sort­ing out whether or not hu­man prefer­ences ap­ply here, it might be best to rea­son “there is a chance that hu­man flour­ish­ing has been lost en­tirely here, so we shouldn’t pay too much at­ten­tion to what hu­man prefer­ences ac­tu­ally are in S, and just avoid S en­tirely”.

Note that this means avoid­ing morally am­bigu­ous dis­tant situ­a­tions, not dis­tant situ­a­tions per se. Wor­lds with vol­un­tary hu­man slaves may be worth avoid­ing, while wor­lds with space­ships, up­loads, but same-as-now moral­ity, are ba­si­cally just “to­day’s world—with lasers!” and are not morally am­bigu­ous.

No comments.