Meta-preferences are weird

(Talk given at an event on Sun­day 28th of June. Char­lie Steiner is re­spon­si­ble for the talk, Ja­cob Lager­ros and David Lam­bert ed­ited the tran­script.

If you’re a cu­rated au­thor and in­ter­ested in giv­ing a 5-min talk, which will then be tran­scribed and ed­ited, sign up here.)

Char­lie Steiner: My talk is called Meta-Prefer­ences Are Weird. It is some­thing I have had on my mind re­cently. I figured I would share my thoughts on de­scribing prefer­ences. And try­ing to imag­ine what it would be like to have some­thing that not only learns our prefer­ences, but also re­spects this idea of meta-prefer­ences — that is, how we would like to learn, and grow, and what we want to be mod­el­led af­ter.

Smok­ing is a very clear-cut ex­am­ple of meta-prefer­ences. If we think of a smoker as hav­ing a meta-prefer­ence against smok­ing, we can cash this out in a cou­ple of differ­ent ways, be­cause prefer­ences aren’t re­ally writ­ten in For­tran on the in­side of our skulls. They’re just a pat­tern that we use to pre­dict things and talk about hu­mans in­clud­ing our­selves.

So when we say a smoker wants to quit smok­ing, it might mean that they would take a pill that stops them from smok­ing if you offered it to them. Or it might mean that they have cer­tain emo­tions and ways of talk­ing that are re­lated to want­ing to quit smok­ing. Or it might mean that, if you are train­ing a su­per­hu­man AI, the smoker would try to change the AI’s code so that it doesn’t con­sider smok­ing as one of the real prefer­ences.

Th­ese ac­tions don’t have a shared essence, but they have a shared pat­tern. So meta-prefer­ences are just a sub­set of prefer­ences which are, in turn, this pat­tern in our be­havi­our and our thoughts.

If I take an ac­tion that changes my prefer­ences (like grow­ing as a hu­man be­ing and de­vel­op­ing em­pa­thy) this will also prob­a­bly change my meta-prefer­ences, and what sort of ac­tions I will take that will im­pact my fu­ture prefer­ences.

So maybe there’s a slip­pery slope effect where once you go part of the way, you’re more likely to go fur­ther. But there could also be non-lin­ear effects where, for ex­am­ple, I make a change that in­creases my in­tel­li­gence like read­ing a text­book. And this changes my meta-prefer­ences in a rel­a­tively un­re­lated way, which I won’t go into be­cause that’s a long di­gres­sion.

The metaphor that I find re­ally com­pel­ling, but I’m not sure if it’s the right metaphor, is of an op­ti­miza­tion land­scape. If we imag­ine this pro­cess of growth and self-re­flec­tion and self-im­prove­ment and pos­si­bly self-de­struc­tion as flow­ing in a op­ti­miza­tion land­scape, then it’s re­ally easy to vi­su­al­ize things in two di­men­sions.

You can vi­su­al­ize this ques­tion. If I am flow­ing down­hill like wa­ter do I end up at a nearby lake? This is what we might imag­ine as some­thing still hu­man, but per­haps with some up­dated prefer­ences and maybe a lit­tle more con­sis­tency.

Or does this pro­cess di­verge? Do I even­tu­ally flow down to the Mis­sis­sippi River and then out into the ocean? In this case, the ocean is some bor­ing state that doesn’t re­flect much of my ori­gin. It loses the in­for­ma­tion that I cur­rently con­tain and in­stead is a basin where lots of differ­ent agents might end up.

It’s an at­trac­tor that at­tracts too much of the space to be in­ter­est­ing. This no­tion of flow is re­ally in­ter­est­ing, and it’s com­pel­ling to try to imag­ine do­ing value learn­ing, and do­ing this ar­tifi­cial or pro­gram­matic up­dat­ing on meta-prefer­ences in a way that looks like this flow.

But in re­al­ity, our ac­tions that af­fect our own prefer­ences might be dis­con­tin­u­ous.

If we have dis­con­tin­u­ous jumps, then it messes up this no­tion of nearby at­trac­tors be­ing good and far away at­trac­tors be­ing bad. But it’s plau­si­ble that there is some way of think­ing about this us­ing a high-di­men­sional rep­re­sen­ta­tion like the ab­stract fea­tures in deep learn­ing where you can smoothly tran­si­tion one face into an­other face, with all points in be­tween still be­ing rec­og­niz­able faces.

And us­ing a high-di­men­sional rep­re­sen­ta­tion might sow the jumps back to­gether so that it re­stores some no­tion of flow un­der meta-prefer­ences. That’s the weird thing.


Ben Pace: Thank you very much. That’s a brilli­ant vi­su­al­iza­tion of chang­ing your hu­man­ity while keep­ing your hu­man­ity. There’s a deep neu­ral net­work chang­ing a face to be a to­tally differ­ent per­son whilst, at each point, still be­ing a per­son. You should use that in a blog post at some point.

I’m try­ing to get a more con­crete han­dle on times when I have meta-prefer­ences. I feel like other than very di­rect ones where I have ad­dic­tions with, like straight­for­ward de­sires that I don’t want, I feel like there’s also a lot of times where I want to change the sort of per­son that I am.

I’m like, “Oh, I wish I was more of that sort of per­son”, and that can have a lot of sur­pris­ing knock-on effects in ways that I didn’t ex­pect. You can try to be more of an agree­able per­son or you can try and be more of a dis­agree­able per­son. And I think this has a lot of sur­pris­ing effects on your val­ues.

Some­times, if you don’t want it, you no­tice and course-cor­rect in time. But some­times you don’t. But I don’t have a good the­ory of ex­actly how to no­tice when these sorts of things will have sur­pris­ing knock-on effects. Ruben Bloom, would you like to ask your ques­tion?

Ruben Bloom: Yes. I’m cu­ri­ous if any of these prob­lems of meta-prefer­ences could be ap­proached from an an­gle of multi-agents, the way that Kaj So­tala tends to write about this. You said, “Okay, I’ve got this smaller sub-agent of me who likes smok­ing, and this other sub-agent who doesn’t like it,” and so forth.

Have you thought about that lens and whether that’s com­pa­rable with what you’re think­ing about? And if so, whether it is a good way to think about it?

Char­lie Steiner: No. I haven’t thought about it like that., You could imag­ine a par­li­a­men­tary model where you think of this as con­tin­u­ously chang­ing the vote share al­lo­cated to differ­ent sub-agents. But again, in re­al­ity, there might be dis­con­tin­u­ous jumps. So it is prob­a­bly not go­ing to be the way I end of think­ing about it.

Ben Pace: Thanks a lot. Ruby, did you want to fol­low back on that or are you good?

Ruben Bloom: The only thing is that I think I would ex­pect it to end up be­ing an in­ter­est­ing equiv­alence with the differ­ent for­mu­las you can put on it. One would be like “Every­thing is like mul­ti­ple agents be­ing piled to­gether.” And this other thing like, “No, we just have a sin­gle en­tity”. And ei­ther frame works.

Ben Pace: Thanks Ruby. Daniel, do you want to ask your ques­tion?

Daniel Koko­ta­jlo: The is­sue of dis­con­tin­u­ous jumps doesn’t seem like it’s go­ing to ul­ti­mately be a prob­lem for you.. It seems like it meshes with a nice tech­ni­cal defi­ni­tion of dis­tance that you were hop­ing for. Surely there are more so­phis­ti­cated defi­ni­tions to get around that prob­lem. Do you share that sense or do you think this re­ally un­der­mines some­thing sig­nifi­cant?

Char­lie Steiner: To go into a lit­tle more de­tail, one thing that I might think of try­ing to do is rather than stick­ing with re­al­ity, where peo­ple are chaotic (and may, for ex­am­ple, com­mit suicide, as a very dis­con­tin­u­ous sort of jump), you might put a dis­tri­bu­tion on what sort of self-al­ter­ing ac­tions they could take and then av­er­age ev­ery­thing to­gether and try to re­move this stochas­tic be­havi­our.

But I think that the non-lin­ear­i­ties in what ac­tions you take kill that, un­less you are do­ing this av­er­ag­ing in a very high-di­men­sional ab­stract space where you’re rep­re­sent­ing all the jumps as lin­ear any­how. So yes, I don’t know. I think it’s plau­si­ble, but I think I’ll keep com­ing up with prob­lems for a while.

Does that make sense?

Daniel Koko­ta­jlo: What I meant was sup­pose you ad­mit that it’s dis­con­tin­u­ous some­times, it seems like you could still say things like, “This big basin of at­trac­tion is prob­a­bly bad and we maybe don’t want to end up there.” Or you could be like, “Well, ac­tu­ally, in this par­tic­u­lar case, I jump straight into that base and in one par­tic­u­lar jump.”

And maybe that’s a rea­son to think that, it’s ac­tu­ally not so bad, at least from my per­spec­tive. Be­cause in some sense, it’s a very short dis­tance from me be­cause there was only one jump. It doesn’t seem to me that ac­cept­ing dis­con­ti­nu­ities into your sys­tem nec­es­sar­ily ru­ins the over­all pic­ture.

Char­lie Steiner: Yes. I would more or less agree. I don’t know. I think it’s hard to say. Once you ac­cept that sort of thing, it’s hard to say, “Well, it’s not so bad be­cause it’s a small dis­tance be­cause it’s just one jump.” Be­cause, I don’t know, dis­tance is ex­actly the prob­lem. But I don’t know. There’s more prob­lems I didn’t go into.