One-step hypothetical preferences

Hu­man prefer­ences are time-in­con­sis­tent, and also con­tra­dic­tory.

That, by it­self, is not a huge prob­lem, but it’s also the case that few hu­man prefer­ences are pre­sent at any given mo­ment. At the mo­ment, I’m fo­cused on find­ing the best ex­pla­na­tion to get my ideas through to you, the reader; I’m not fo­cused on my moral prefer­ences, per­sonal safety de­sires, poli­ti­cal be­liefs, or taste in mu­sic.

If any­one asked me about those, I could im­me­di­ately bring them to mind. My an­swers to stan­dard ques­tions are kinda in the back­ground, ac­cessible but not ac­cessed. Wei Dai made a similar point about trans­la­tors: they have a lot of trained knowl­edge that is not im­me­di­ately ac­cessible to their in­tro­spec­tion. And only by giv­ing them the in­puts they were trained on (eg words, sen­tences,...) can you bring that knowl­edge to the fore.

In this post, I’ll try and for­mal­ise these ac­cessible prefer­ences, start­ing with for­mal­is­ing prefer­ences in gen­eral.

Ba­sic prefer­ences setup

This sec­tion will for­mal­ise the setup pre­sented in Alice’s ex­am­ple. Let be a set of all pos­si­ble wor­lds. A hu­man makes use of a model . This model con­tains a lot of vari­ables , called prop­er­ties. Th­ese take val­ues in a do­main .

A ba­sic set of states in is a set of pos­si­ble val­ues for some of the . Thus , with . The prop­erty un­con­strained in if . A gen­eral set of states is a union of ba­sic ; let be these of all these sets of states.

For ex­am­ple, a hu­man could be imag­in­ing four of their friends, and the could be whether friend is sleep­ing with friend ( differ­ent Boolean ), and also whether a third friend be­lieves two oth­ers are sleep­ing to­gether ( differ­ent , tak­ing val­ues in sleep­ing to­gether, not sleep­ing to­gether, don’t know).

Then a state­ment of hu­man gos­sip like ″X is sleep­ing with Y, but A doesn’t re­al­ise it; in fact, A thinks that Y is sleep­ing with Z, which is to­tally not true!” is en­coded as:

  • , with the other s un­con­strained.

It’s in­ter­est­ing how un­in­tu­itive that for­mu­la­tion is, com­pared with how our brains in­stinc­tively parse gos­sip.

To make use of these, these sym­bols need to be grounded. This is achieved via a func­tion that takes a set of states and maps it to a set of wor­lds: .

Fi­nally, the hu­man ex­presses a judge­ment about the states of , men­tally cat­e­goris­ing a set of states as bet­ter than an­other. This is an anti-sym­met­ric par­tial func­tion , a par­tial func­tion that is non triv­ial on at least one pair of in­puts.

For ex­am­ple, if is the gos­sip set above, and is the same state­ment with , then a hu­man that val­ues hon­esty might judge ; ie it is worse if be­lieves a lie about and .

The sign of in­forms which set the hu­man prefers; the mag­ni­tude is the difficult-to-define weight or in­ten­sity of the prefer­ence.

Hy­po­thet­i­cals posed to the human

Let be the set of pos­si­ble pairs defined in the pre­vi­ous sec­tion. Hu­mans rarely con­sider many at the same time. We of­ten only con­sider one, or zero.

A hy­po­thet­i­cal is some pos­si­ble short in­ter­ven­tion—a friend asks them a ques­tion, they get an email, a TV in the back­ground shows some­thing salient—that will cause a hu­man to men­tally use a model and pass judge­ment within it. Note that this not the same as Paul Chris­ti­ano’s defi­ni­tion of as­crip­tion : we don’t ac­tu­ally need the hu­man to an­swer any­thing, just to think.

So if is the set of pos­si­ble hy­po­thet­i­cal in­ter­ven­tions at time , we have a (coun­ter­fac­tual) map from to .

Now, not all mo­ments are ideal for a hu­man to do much re­flec­tion (though a lot of in­stinc­tive re­ac­tions are also very in­for­ma­tive). So it might be good to ex­pand the time a bit, to say, a week, and con­sider all the mod­els that a hu­man could be hy­po­thet­i­cally be made to con­sider in that time.

So let be the set of hy­po­thet­i­cal short in­ter­ven­tions from time to , given that this in­ter­ven­tion is the first in that time pe­riod. Then there is a nat­u­ral map

  • .

Ideal­ised object

The map is a highly ideal­ised and coun­ter­fac­tual ob­ject—there is no way we can ac­tu­ally test a hu­man on the vast num­ber of pos­si­ble in­ter­ven­tions. So the AI would not be tasked with “use to es­tab­lish hu­man prefer­ences”, but “es­ti­mate to es­ti­mate hu­man prefer­ences”.

The will also re­veal a lot of con­tra­dic­tions, since hu­mans of­ten have differ­ent opinions on the same sub­ject, de­pend­ing on how the in­ter­ven­tion or ques­tion is phrased. Differ­ent phras­ings may trig­ger differ­ent in­ter­nal mod­els of the same is­sue, or even differ­ent judge­ments within the same model. And, of course, the same in­ter­ven­tion at differ­ent times (or by differ­ent agents) may trig­ger differ­ent re­ac­tions.

But deal­ing with con­tra­dic­tions is just one of the things that we have to sort out with hu­man prefer­ences.

Min­i­mum modification

I men­tioned the in­ter­ven­tions should be short; that should be a short pe­riod; and that the in­ter­ven­tions in should be the first in that time pe­riod. The whole idea is to avoid “mod­ify­ing” the hu­man too much, or giv­ing the AI too much power to change, rather than re­flect, the hu­man’s val­ues. The hu­man’s re­ac­tion should be as close as pos­si­ble to an un­var­nished ini­tial re­ac­tion.

There may be other ways of re­duc­ing the AI’s in­fluence, but it is still use­ful to get these ini­tial re­ac­tions.

One-step hypotheticals

In slight con­trast with the pre­vi­ous sec­tion, it is very valuable to get the hu­man to re­flect on new is­sues they hadn’t con­sid­ered be­fore. For ex­am­ple, we could in­tro­duce them to philo­soph­i­cal thought ex­per­i­ments they hadn’t seen be­fore (maybe the trol­ley prob­lem or the re­pug­nant con­clu­sion, or un­usual var­i­ants of these), or pre­sent ideas that cross across their usual poli­ti­cal bound­aries, or the bound­aries of their cat­e­gories (eg whether Ne­an­derthals should have hu­man rights if a tribe of them were sud­denly dis­cov­ered to­day).

This is, in a sense, a min­i­mum ex­trap­o­la­tion, the very first ten­ta­tive step of CEV. We are not ask­ing what the hu­man would think if they were smarter, but in­stead what they would think if they en­coun­tered a novel prob­lem for the first time.

Th­ese “one-step hy­po­thet­i­cals” are thus differ­ent from the hu­man’s ev­ery­day cur­rent judge­ment, yet don’t in­volve trans­form­ing the hu­man into some­thing else.

EDIT: Av­turchin asks whether I ex­pect these one-step hy­po­thet­i­cals to re­veal hid­den prefer­ences, or to force hu­mans to make a choice, know­ing that they might have made a differ­ent choice in differ­ent cir­cum­stances.

The an­swer is… a bit of both. I ex­pect the hy­po­thet­i­cals to some­times con­tra­dict each other, de­pend­ing on the phras­ing and the timing. I ex­pect them to con­tra­dict each other more than more usual ques­tions (“zero-step hy­po­thet­i­cals”) do.

But I don’t ex­pect the an­swers to be com­pletely ran­dom, ei­ther. There will be a lot of in­for­ma­tion there. And the pat­tern of differ­ent lead­ing to differ­ent or con­tra­dic­tory is rele­vant, and not ran­dom.