Now that I’ve writ­ten Learn­ing Nor­ma­tivity, I have some more clar­ity around the con­cept of “nor­ma­tivity” I was try­ing to get at, and want to write about it more di­rectly. Whereas that post was more ori­ented to­ward the ma­chine learn­ing side of things, this post is more ori­ented to­ward the philo­soph­i­cal side. How­ever, it is still rele­vant to the re­search di­rec­tion, and I’ll men­tion some is­sues rele­vant to value learn­ing and other al­ign­ment ap­proaches.

How can we talk about what you “should” do?

A Highly Depen­dent Concept

Now, ob­vi­ously, what you should do de­pends on your goals. We can (at least as a rough first model) en­code this as a util­ity func­tion (but see my ob­jec­tion).

What you should do also de­pends on what’s the case. Or, re­ally, it de­pends on what you be­lieve is the case, since that’s what you have to go on.

Since we also have un­cer­tainty about val­ues (and we’re in­ter­ested in build­ing ma­chines which should have value un­cer­tainty as well, in or­der to do value learn­ing), we have to talk about be­liefs-about-goals, too. (Or be­liefs about util­ity func­tions, or how­ever it ends up get­ting for­mal­ized.) This in­cludes moral un­cer­tainty.

Even worse, we have a lot of un­cer­tainty about de­ci­sion the­ory—that is, we have un­cer­tainty about how to take all of this un­cer­tainty we have, and make it into de­ci­sions. Now, ideally, de­ci­sion the­ory is not some­thing the nor­ma­tively cor­rect thing de­pends on, like all the pre­vi­ous points, but rather is a frame­work for find­ing the nor­ma­tively cor­rect thing given all of those things. How­ever, as long as we’re un­cer­tain about de­ci­sion the­ory, we have to take that un­cer­tainty as in­put too—so, if de­ci­sion the­ory is to give ad­vice to re­al­is­tic agents who are them­selves un­cer­tain about de­ci­sion the­ory, de­ci­sion the­ory also takes de­ci­sion-the­o­retic un­cer­tainty as an in­put. (In the best case, this makes bad de­ci­sion the­o­ries ca­pa­ble of self-im­prove­ment.)

Clearly, we can be un­cer­tain about how that is sup­posed to work.

By now you might get the idea. “Should” de­pends on some nec­es­sary in­for­ma­tion (let’s call them the “givens”). But for each set of givens you claim is com­plete, there can be rea­son­able doubt about how to use those givens to de­ter­mine the out­put. So we can cre­ate meta-level givens about how to use those givens.

Rather than stop­ping at some finite level, such as learn­ing the hu­man util­ity func­tion, I’m claiming that we should learn all the lev­els. This is what I mean by “nor­ma­tivity”—the in­for­ma­tion at all the meta-lev­els, which we would get if we were to un­pack “should” for­ever. I’m putting this out there as my guess at the right type sig­na­ture for hu­man val­ues.

I’m not mainly ex­cited about this be­cause I’m es­pe­cially ex­cited about in­clud­ing moral un­cer­tainty or un­cer­tainty about the cor­rect de­ci­sion the­ory into a friendly AI—or be­cause I think those are go­ing to be par­tic­u­larly huge failure modes which we need to avert. Rather, I’m ex­cited about this be­cause it is the first time I’ve felt like I’ve had any han­dles at all for get­ting ba­sic al­ign­ment prob­lems right (wire­head­ing, hu­man ma­nipu­la­tion, good­hart­ing, on­tolog­i­cal crisis) with­out a feel­ing that things are ob­vi­ously go­ing to blow up in some other way.

Nor­ma­tive vs De­scrip­tive Reasoning

At this stage you might ac­cuse me of com­mit­ting the “tur­tles all the way down” fal­lacy. In Pass­ing The Re­cur­sive Buck, Eliezer de­scribes the er­ror of ac­ci­den­tally posit­ing an in­finite hi­er­ar­chy of ex­pla­na­tions:

The gen­eral an­tipat­tern at work might be called “Pass­ing the Re­cur­sive Buck”.


How do you stop a re­cur­sive buck from pass­ing?

You use the counter-pat­tern: The Re­cur­sive Buck Stops Here.

But how do you ap­ply this counter-pat­tern?

You use the re­cur­sive buck-stop­ping trick.

And what does it take to ex­e­cute this trick?

Re­cur­sive buck stop­ping tal­ent.

And how do you de­velop this tal­ent?

Get a lot of prac­tice stop­ping re­cur­sive bucks.


How­ever, In Where Re­cur­sive Jus­tifi­ca­tion Hits Rock Bot­tom, Eliezer dis­cusses a kind of in­finite-re­cur­sion rea­son­ing ap­plied to nor­ma­tive mat­ters. He says:

But I would nonethe­less em­pha­size the differ­ence be­tween say­ing:

“Here is this as­sump­tion I can­not jus­tify, which must be sim­ply taken, and not fur­ther ex­am­ined.”

Ver­sus say­ing:

“Here the in­quiry con­tinues to ex­am­ine this as­sump­tion, with the full force of my pre­sent in­tel­li­gence—as op­posed to the full force of some­thing else, like a ran­dom num­ber gen­er­a­tor or a magic 8-ball—even though my pre­sent in­tel­li­gence hap­pens to be founded on this as­sump­tion.”

Still… wouldn’t it be nice if we could ex­am­ine the prob­lem of how much to trust our brains with­out us­ing our cur­rent in­tel­li­gence? Wouldn’t it be nice if we could ex­am­ine the prob­lem of how to think, with­out us­ing our cur­rent grasp of ra­tio­nal­ity?

When you phrase it that way, it starts look­ing like the an­swer might be “No”.

So, with re­spect to nor­ma­tive ques­tions, such as what to be­lieve, or how to rea­son, we can and (to some ex­tent) should keep un­pack­ing rea­sons for­ever—ev­ery as­sump­tion is sub­ject to fur­ther scrutiny, and as a prac­ti­cal mat­ter we have quite a bit of un­cer­tainty about meta-level things such as our val­ues, how to think about our val­ues, etc.

This is true de­spite the fact that with re­spect to the de­scrip­tive ques­tions the re­cur­sive buck must stop some­where. Tak­ing a de­scrip­tive stance, my val­ues and be­liefs live in my neu­rons. From this per­spec­tive, “hu­man logic” is not some ad­vanced logic which lo­gi­ci­ans may dis­cover some day, but rather, just the set of ar­gu­ments hu­mans ac­tu­ally re­spond to. Again quot­ing an­other Eliezer ar­ti­cle,

The phrase that once came into my mind to de­scribe this re­quire­ment, is that a mind must be cre­ated already in mo­tion. There is no ar­gu­ment so com­pel­ling that it will give dy­nam­ics to a static thing. There is no com­puter pro­gram so per­sua­sive that you can run it on a rock.

So in a de­scrip­tive sense the ground truth about your val­ues is just what you would ac­tu­ally do in situ­a­tions, or some in­for­ma­tion about the re­ward sys­tems in your brain, or some­thing re­sem­bling that. In a de­scrip­tive sense the ground truth about hu­man logic is just the sum to­tal of facts about which ar­gu­ments hu­mans will ac­cept.

But in a nor­ma­tive sense, there is no ground truth for hu­man val­ues; in­stead, we have an up­dat­ing pro­cess which can change its mind about any par­tic­u­lar thing; and that up­dat­ing pro­cess it­self is not the ground truth, but rather has be­liefs (which can change) about what makes an up­dat­ing pro­cess le­gi­t­i­mate. Quot­ing from the rele­vant sec­tion of Rad­i­cal Prob­a­bil­ism:

The rad­i­cal prob­a­bil­ist does not trust what­ever they be­lieve next. Rather, the rad­i­cal prob­a­bil­ist has a con­cept of vir­tu­ous epistemic pro­cess, and is will­ing to be­lieve the next out­put of such a pro­cess. Dis­rup­tions to the epistemic pro­cess do not get this sort of trust with­out rea­son.

I worry that many ap­proaches to value learn­ing at­tempt to learn a de­scrip­tive no­tion of hu­man val­ues, rather than the nor­ma­tive no­tion. This means stop­ping at some spe­cific proxy, such as what hu­mans say their val­ues are, or what hu­mans re­veal their prefer­ences to be through ac­tion, rather than leav­ing the proxy flex­ible and try­ing to learn it as well, while also main­tain­ing un­cer­tainty about how to learn, and so on.

I’ve men­tioned “un­cer­tainty” a lot while try­ing to un­pack my hi­er­ar­chi­cal no­tion of nor­ma­tivity. This is partly be­cause I want to in­sist that we have “un­cer­tainty at ev­ery level of the hi­er­ar­chy”, but also be­cause un­cer­tainty is it­self a no­tion to which nor­ma­tivity ap­plies, and thus, gen­er­ates new lev­els of the hi­er­ar­chy.

Nor­ma­tive Beliefs

Just as one might ar­gue that logic should be based on a spe­cific set of ax­ioms, with spe­cific de­duc­tion rules (and a spe­cific se­quent calcu­lus, etc), one might similarly ar­gue that un­cer­tainty should be man­aged by a spe­cific prob­a­bil­ity the­ory (such as the Kol­mogorov ax­ioms), with a spe­cific kind of prior (such as a de­scrip­tion-length prior), and spe­cific up­date rules (such as Bayes’ Rule), etc.

This gen­eral ap­proach—that we set up our bedrock as­sump­tions from which to pro­ceed—is called “foun­da­tion­al­ism”.

I claim that we can’t keep strictly to Bayes’ Rule—not if we want to model highly-ca­pa­ble sys­tems in gen­eral, not if we want to de­scribe hu­man rea­son­ing, and not if we want to cap­ture (the nor­ma­tive) hu­man val­ues. In­stead, how to up­date in a spe­cific in­stance is a more com­plex mat­ter which agents must figure out.

I claim that the Kol­mogorov ax­ioms don’t tell us how to rea­son—we need more than an un­com­putable ideal; we also need ad­vice about what to do in our bound­edly-ra­tio­nal situ­a­tion.

And, fi­nally, I claim that length-based pri­ors such as the Solomonoff prior are ma­lign—de­scrip­tion length seems to be a re­ally im­por­tant heuris­tic, but there are other crite­ria which we want to judge hy­pothe­ses by.

So, over­all, I’m claiming that a nor­ma­tive the­ory of be­lief is a lot more com­plex than Solomonoff would have you be­lieve. Things that once seemed ob­jec­tively true now look like rules of thumb. This means the ques­tion of nor­ma­tivity cor­rect be­hav­ior is wide open even in the sim­ple case of try­ing to pre­dict what comes next in a se­quence.

Now, Log­i­cal In­duc­tion ad­dresses all three of these points (at least, giv­ing us progress on all three fronts). We could take the les­son to be: we just had to go “one level higher”, set­ting up a sys­tem like log­i­cal in­duc­tion which learns how to prob­a­bil­is­ti­cally rea­son. Now we are at the right level for foun­da­tion­al­ism. Log­i­cal in­duc­tion, not clas­si­cal prob­a­bil­ity the­ory, is the right prin­ci­ple for cod­ify­ing cor­rect rea­son­ing.

Or, if not log­i­cal in­duc­tion, per­haps the next meta-level will turn out to be the right one?

But what if we don’t have to find a foun­da­tional level?

I’ve up­dated to a kind of quasi-anti-foun­da­tion­al­ist po­si­tion. I’m not against find­ing a strong foun­da­tion in prin­ci­ple (and in­deed, I think it’s a use­ful pro­ject!), but I’m say­ing that as a mat­ter of fact, we have a lot of un­cer­tainty, and it sure would be nice to have a nor­ma­tive the­ory which al­lowed us to ac­count for that (a kind of afoun­da­tion­al­ist nor­ma­tive the­ory—not anti-foun­da­tion­al­ist, but not strictly foun­da­tion­al­ist, ei­ther). This should still be a strong for­mal the­ory, but one which re­quires weaker as­sump­tions than usual (in much the same way rea­son­ing about the world via prob­a­bil­ity the­ory re­quires weaker as­sump­tions than rea­son­ing about the world via pure logic).

Stop­ping at

My main ob­jec­tion to anti-foun­da­tion­al­ist po­si­tions is that they’re just giv­ing up; they don’t an­swer ques­tions and offer in­sight. Per­haps that’s a lack of un­der­stand­ing on my part. (I haven’t tried that hard to un­der­stand anti-foun­da­tion­al­ist po­si­tions.) But I still feel that way.

So, rather than give up, I want to provide a frame­work which holds across meta-lev­els (as I dis­cussed in Learn­ing Nor­ma­tivity).

This would be a frame­work in which an agent can bal­ance un­cer­tainty at all the lev­els, with­out dog­matic foun­da­tional be­liefs at any level.

Doesn’t this just cre­ate a new in­finite meta-level, above all of the finite meta-lev­els?

A math­e­mat­i­cal anal­ogy would be to say that I’m go­ing for “car­di­nal in­finity” rather than “or­di­nal in­finity”. The first or­di­nal in­finity is , which is greater than all finite num­bers. But is less than . So build­ing some­thing at “level ” would in­deed be “just an­other meta-level” which could be sur­passed by level , which could be sur­passed by , and so on.

Car­di­nal in­fini­ties, on the other hand, don’t work like that. The first in­finite car­di­nal is , but -- we can’t get big­ger by adding one. This is the sort of meta-level I want: a meta-level which also over­sees it­self in some sense, so that we aren’t just cre­at­ing a new level at which prob­lems can arise.

This is what I meant by “col­laps­ing the meta-lev­els” in Learn­ing Nor­ma­tivity. The finite lev­els might still ex­ist, but there’s a level at which ev­ery­thing can be put to­gether.

Still, even so, isn’t this still a “foun­da­tion” at some level?

Well, yes and no. It should be a frame­work in which a very broad range of rea­son­ing could be sup­ported, while also mak­ing some ra­tio­nal­ity as­sump­tions. In this sense it would be a the­ory of ra­tio­nal­ity pur­port­ing to “ex­plain” (ie cat­e­go­rize/​or­ga­nize) all ra­tio­nal rea­son­ing (with a par­tic­u­lar, but broad, no­tion of ra­tio­nal). In this sense it seems not so differ­ent from other foun­da­tional the­o­ries.

On the other hand, this would be some­thing more pro­vi­sional by de­sign—some­thing which would “get out of the way” of a real foun­da­tion if one ar­rived. It would seek to make far fewer claims over­all than is usual for a foun­da­tion­al­ist the­ory.

What’s the hi­er­ar­chy?

So far, I’ve been pretty vague about the ac­tual hi­er­ar­chy, aside from giv­ing ex­am­ples and talk­ing about “meta-lev­els”.

The anal­ogy brings to mind a lin­ear hi­er­ar­chy, with a first level and a se­ries of higher and higher lev­els. Each next level does some­thing like “han­dling un­cer­tainty about the pre­vi­ous level”.

How­ever, my re­cur­sive quan­tiliza­tion pro­posal cre­ated a branch­ing hi­er­ar­chy. This is be­cause the build­ing block for that hi­er­ar­chy re­quired sev­eral in­puts.

I think the ex­act form of the hi­er­ar­chy is a mat­ter for spe­cific pro­pos­als. But I do think some spe­cific lev­els ought to ex­ist:

  • Ob­ject-level val­ues.

  • In­for­ma­tion about value-learn­ing, which helps up­date the ob­ject-level val­ues.

  • Ob­ject-level be­liefs.

  • Generic in­for­ma­tion about what dis­t­in­guishes a good hy­poth­e­sis. This in­cludes Oc­cam’s ra­zor as well as in­for­ma­tion about what makes a hy­poth­e­sis ma­lign.

Nor­ma­tive Values

It’s difficult to be­lieve hu­mans have a util­ity func­tion.

It’s eas­ier to be­lieve hu­mans have ex­pec­ta­tions on propo­si­tions, but this still falls apart at the seams (EG, not all propo­si­tions are ex­plic­itly rep­re­sented in my head at a given mo­ment, it’ll be difficult to define ex­actly which neu­ral sig­nals are the ex­pec­ta­tions, etc).

We can try to define val­ues as what we would think if we had a re­ally long time to con­sider the ques­tion; but this has its own prob­lems, such as hu­mans go­ing crazy or ex­pe­rienc­ing value drift if they think for too long.

We can try to define val­ues as what a hu­man would think af­ter an hour, if that hu­man had ac­cess to HCH; but this re­lies on the limited abil­ity of a hu­man to use HCH to ac­cel­er­ate philo­soph­i­cal progress.

Imag­ine a value-learn­ing sys­tem where you don’t have to give any solid defi­ni­tion of what it is for hu­mans to have val­ues, but rather, can give a num­ber of prox­ies, point to flaws in the prox­ies, give feed­back on how to rea­son about those flaws, and so on. The sys­tem would try to gen­er­al­ize all of this rea­son­ing, to figure out what the thing be­ing pointed at could be.

We could de­scribe hu­mans de­liber­at­ing un­der ideal con­di­tions, point out is­sues with hu­mans get­ting old, dis­cuss what it might mean for those hu­mans to go crazy or ex­pe­rience value drift, ex­am­ine how the sys­tem is rea­son­ing about all of this and give feed­back, dis­cuss what it would mean for those hu­mans to rea­son well or poorly, …

We could never en­tirely pin down the con­cept of hu­man val­ues, but at some point, the sys­tem would be rea­son­ing so much like us (or rather, so much like we would want to rea­son) that this wouldn’t be a con­cern.

Com­par­i­son to Other Approaches

This is most di­rectly an ap­proach for solv­ing meta-philos­o­phy.

Ob­vi­ously, the di­rec­tion in­di­cated in this post has a lot in com­mon with Paul-style ap­proaches. My out­side view is that this is me rea­son­ing my way around to a Paul-ish po­si­tion. How­ever, my in­side view still has sig­nifi­cant differ­ences, which I haven’t fully ar­tic­u­lated for my­self yet.