# One-step hypothetical preferences

Hu­man prefer­ences are time-in­con­sis­tent, and also con­tra­dic­tory.

That, by it­self, is not a huge prob­lem, but it’s also the case that few hu­man prefer­ences are pre­sent at any given mo­ment. At the mo­ment, I’m fo­cused on find­ing the best ex­pla­na­tion to get my ideas through to you, the reader; I’m not fo­cused on my moral prefer­ences, per­sonal safety de­sires, poli­ti­cal be­liefs, or taste in mu­sic.

If any­one asked me about those, I could im­me­di­ately bring them to mind. My an­swers to stan­dard ques­tions are kinda in the back­ground, ac­cessible but not ac­cessed. Wei Dai made a similar point about trans­la­tors: they have a lot of trained knowl­edge that is not im­me­di­ately ac­cessible to their in­tro­spec­tion. And only by giv­ing them the in­puts they were trained on (eg words, sen­tences,...) can you bring that knowl­edge to the fore.

In this post, I’ll try and for­mal­ise these ac­cessible prefer­ences, start­ing with for­mal­is­ing prefer­ences in gen­eral.

## Ba­sic prefer­ences setup

This sec­tion will for­mal­ise the setup pre­sented in Alice’s ex­am­ple. Let be a set of all pos­si­ble wor­lds. A hu­man makes use of a model . This model con­tains a lot of vari­ables , called prop­er­ties. Th­ese take val­ues in a do­main .

A ba­sic set of states in is a set of pos­si­ble val­ues for some of the . Thus , with . The prop­erty un­con­strained in if . A gen­eral set of states is a union of ba­sic ; let be these of all these sets of states.

For ex­am­ple, a hu­man could be imag­in­ing four of their friends, and the could be whether friend is sleep­ing with friend ( differ­ent Boolean ), and also whether a third friend be­lieves two oth­ers are sleep­ing to­gether ( differ­ent , tak­ing val­ues in sleep­ing to­gether, not sleep­ing to­gether, don’t know).

Then a state­ment of hu­man gos­sip like ″X is sleep­ing with Y, but A doesn’t re­al­ise it; in fact, A thinks that Y is sleep­ing with Z, which is to­tally not true!” is en­coded as:

• , with the other s un­con­strained.

It’s in­ter­est­ing how un­in­tu­itive that for­mu­la­tion is, com­pared with how our brains in­stinc­tively parse gos­sip.

To make use of these, these sym­bols need to be grounded. This is achieved via a func­tion that takes a set of states and maps it to a set of wor­lds: .

Fi­nally, the hu­man ex­presses a judge­ment about the states of , men­tally cat­e­goris­ing a set of states as bet­ter than an­other. This is an anti-sym­met­ric par­tial func­tion , a par­tial func­tion that is non triv­ial on at least one pair of in­puts.

For ex­am­ple, if is the gos­sip set above, and is the same state­ment with , then a hu­man that val­ues hon­esty might judge ; ie it is worse if be­lieves a lie about and .

The sign of in­forms which set the hu­man prefers; the mag­ni­tude is the difficult-to-define weight or in­ten­sity of the prefer­ence.

## Hy­po­thet­i­cals posed to the human

Let be the set of pos­si­ble pairs defined in the pre­vi­ous sec­tion. Hu­mans rarely con­sider many at the same time. We of­ten only con­sider one, or zero.

A hy­po­thet­i­cal is some pos­si­ble short in­ter­ven­tion—a friend asks them a ques­tion, they get an email, a TV in the back­ground shows some­thing salient—that will cause a hu­man to men­tally use a model and pass judge­ment within it. Note that this not the same as Paul Chris­ti­ano’s defi­ni­tion of as­crip­tion : we don’t ac­tu­ally need the hu­man to an­swer any­thing, just to think.

So if is the set of pos­si­ble hy­po­thet­i­cal in­ter­ven­tions at time , we have a (coun­ter­fac­tual) map from to .

Now, not all mo­ments are ideal for a hu­man to do much re­flec­tion (though a lot of in­stinc­tive re­ac­tions are also very in­for­ma­tive). So it might be good to ex­pand the time a bit, to say, a week, and con­sider all the mod­els that a hu­man could be hy­po­thet­i­cally be made to con­sider in that time.

So let be the set of hy­po­thet­i­cal short in­ter­ven­tions from time to , given that this in­ter­ven­tion is the first in that time pe­riod. Then there is a nat­u­ral map

• .

## Ideal­ised object

The map is a highly ideal­ised and coun­ter­fac­tual ob­ject—there is no way we can ac­tu­ally test a hu­man on the vast num­ber of pos­si­ble in­ter­ven­tions. So the AI would not be tasked with “use to es­tab­lish hu­man prefer­ences”, but “es­ti­mate to es­ti­mate hu­man prefer­ences”.

The will also re­veal a lot of con­tra­dic­tions, since hu­mans of­ten have differ­ent opinions on the same sub­ject, de­pend­ing on how the in­ter­ven­tion or ques­tion is phrased. Differ­ent phras­ings may trig­ger differ­ent in­ter­nal mod­els of the same is­sue, or even differ­ent judge­ments within the same model. And, of course, the same in­ter­ven­tion at differ­ent times (or by differ­ent agents) may trig­ger differ­ent re­ac­tions.

But deal­ing with con­tra­dic­tions is just one of the things that we have to sort out with hu­man prefer­ences.

## Min­i­mum modification

I men­tioned the in­ter­ven­tions should be short; that should be a short pe­riod; and that the in­ter­ven­tions in should be the first in that time pe­riod. The whole idea is to avoid “mod­ify­ing” the hu­man too much, or giv­ing the AI too much power to change, rather than re­flect, the hu­man’s val­ues. The hu­man’s re­ac­tion should be as close as pos­si­ble to an un­var­nished ini­tial re­ac­tion.

There may be other ways of re­duc­ing the AI’s in­fluence, but it is still use­ful to get these ini­tial re­ac­tions.

## One-step hypotheticals

In slight con­trast with the pre­vi­ous sec­tion, it is very valuable to get the hu­man to re­flect on new is­sues they hadn’t con­sid­ered be­fore. For ex­am­ple, we could in­tro­duce them to philo­soph­i­cal thought ex­per­i­ments they hadn’t seen be­fore (maybe the trol­ley prob­lem or the re­pug­nant con­clu­sion, or un­usual var­i­ants of these), or pre­sent ideas that cross across their usual poli­ti­cal bound­aries, or the bound­aries of their cat­e­gories (eg whether Ne­an­derthals should have hu­man rights if a tribe of them were sud­denly dis­cov­ered to­day).

This is, in a sense, a min­i­mum ex­trap­o­la­tion, the very first ten­ta­tive step of CEV. We are not ask­ing what the hu­man would think if they were smarter, but in­stead what they would think if they en­coun­tered a novel prob­lem for the first time.

Th­ese “one-step hy­po­thet­i­cals” are thus differ­ent from the hu­man’s ev­ery­day cur­rent judge­ment, yet don’t in­volve trans­form­ing the hu­man into some­thing else.

EDIT: Av­turchin asks whether I ex­pect these one-step hy­po­thet­i­cals to re­veal hid­den prefer­ences, or to force hu­mans to make a choice, know­ing that they might have made a differ­ent choice in differ­ent cir­cum­stances.

The an­swer is… a bit of both. I ex­pect the hy­po­thet­i­cals to some­times con­tra­dict each other, de­pend­ing on the phras­ing and the timing. I ex­pect them to con­tra­dict each other more than more usual ques­tions (“zero-step hy­po­thet­i­cals”) do.

But I don’t ex­pect the an­swers to be com­pletely ran­dom, ei­ther. There will be a lot of in­for­ma­tion there. And the pat­tern of differ­ent lead­ing to differ­ent or con­tra­dic­tory is rele­vant, and not ran­dom.

• Fi­nally, the hu­man ex­presses a judge­ment about the states of M, men­tally cat­e­goris­ing a set of states as bet­ter than an­other. This is an anti-sym­met­ric par­tial func­tion J:S×S→R, a par­tial func­tion that is non triv­ial on at least one pair of in­puts.

I con­tinue to be un­sure if we can even claim anti-sym­me­try of the prefer­ence re­la­tion. For ex­am­ple, let be the state “I eat an ap­ple” and the state “I eat an or­ange”, and to­day but to­mor­row , seem­ingly vi­o­lat­ing an­ti­sym­me­try. Now of course maybe I mi­s­un­der­stood my own un­der­stand­ing of and such that they ac­tu­ally in­cluded a hid­den-to-my-aware­ness prop­erty con­di­tion­ing them on time or some­thing else such that anti-sym­me­try is not vi­o­lated, but the fact that there may be some prop­erty on the states that I didn’t think about at first that sal­vages anti-sym­me­try makes my worry that this model is con­fused in this and other ways be­cause it was so eas­ily to think of and con­struct some­thing that seem­ingly vi­o­lated the prop­erty but then on fur­ther re­flec­tion seems like it doesn’t.

That’s not a slam-dunk ar­gu­ment against this for­mal­iza­tion. This is more me shar­ing some thoughts on my reser­va­tions of us­ing this type of model. If we can so eas­ily fail to no­tice some­thing rele­vant about how we for­mal­ize some sim­ple prefer­ences, what else may we be failing to no­tice? And if so what hap­pens if we build an AI based in part on this for­mal­iza­tion? Will it also fail to ac­count for rele­vant as­pects of how hu­man prefer­ences are calcu­lated be­cause they are not eas­ily visi­ble to us in the model, or is that a failure of hu­mans to un­der­stand them­selves rather than the model? Th­ese are the things I’m wrestling with lately.

I also have some reser­va­tions about whether we can even re­ally model hu­mans has hav­ing dis­crete prefer­ences that we can rea­son about in this way with­out get­ting our­selves into trou­ble and con­fused. Not to say that I doubt that this model of­ten works, only that I worry that it’s miss­ing some im­por­tant de­tails that are rele­vant for al­ign­ment and with­out ac­count­ing for them we will fail to pro­duce al­igned AI. I worry this be­cause there doesn’t seem to be any­thing in the hu­man mind that ac­tu­ally is a prefer­ence; prefer­ences are more like reifi­ca­tions of a pat­tern of ac­tion that ap­pears in hu­mans. Get­ting closer to un­der­stand­ing the mechanism that pro­duces the pat­tern we in­ter­pret as prefer­ences seems valuable to me in this work be­cause I worry we’re miss­ing cru­cial de­tails when we rea­son about prefer­ences at the level of de­tail you pur­sue here.

• I see the or­ange-ap­ple prefer­ence re­ver­sal as an­other ex­am­ple of con­di­tional prefer­ences.

• I agree that view­ing prefer­ences as con­di­tioned on the en­vi­ron­ment, up to and in­clud­ing the en­tire his­tory of the ob­serv­able uni­verse, is a sen­si­ble im­prove­ment over many more sim­plis­tic mod­els that re­sult in clear vi­o­la­tions of prefer­ence nor­ma­tivity and elimi­nates many of those vi­o­la­tions. My con­cern is that, given that this is not so ob­vi­ous as to be the nor­mal way of think­ing about prefer­ences in all fields and was nonob­vi­ous enough that you had to write a post about the point, this makes me cau­tious about up­dat­ing to think­ing this is suffi­cient to make the cur­rent value ab­strac­tion you use suffi­cient for pur­poses of AI al­ign­ment. I ba­si­cally view con­di­tion­al­ity of prefer­ences as neu­tral ev­i­dence about the ex­plana­tory power of the the­ory (for the pur­pose of AI al­ign­ment).

• Valid point, though con­di­tional meta-prefer­ences are things I’ve already writ­ten about, and the is­sue of be­ing wrong now about what your own prefer­ences would be in the fu­ture, is also some­thing I’ve ad­dressed mul­ti­ple times in differ­ent forms. Your ex­am­ple is par­tic­u­larly crisp, though.

• Do these “one step hy­po­thet­i­cals” re­veal hid­den prefer­ences, or force a hu­man to make a choice, to which she will later stick to pre­serve her con­sis­tency? For ex­am­ple, I could make a ran­dom an­swer to a ques­tion about Ne­an­derthal tribe rights, but later ra­tio­nal­ise why it should be true. I think I have heard of some psy­cholog­i­cal re­search which demon­strated such be­havi­our.

• Added an ad­den­dum to the post to ad­dress this is­sue. The “later ra­tio­nal­ise” is not re­ally rele­vant here, be­cause we’re not think­ing of do­ing all these hy­po­thet­i­cal in­ter­ven­tions.