Minimization of prediction error as a foundation for human values in AI alignment

I’ve men­tioned in posts twice (and pre­vi­ously in sev­eral com­ments) that I’m ex­cited about pre­dic­tive cod­ing, speci­fi­cally the idea that the hu­man brain ei­ther is or can be mod­eled as a hi­er­ar­chi­cal sys­tem of (nega­tive feed­back) con­trol sys­tems that try to min­i­mize er­ror in pre­dict­ing their in­puts with some strong (pos­si­bly un-up­dat­able) pre­dic­tion set points (pri­ors). I’m ex­cited be­cause I be­lieve this ap­proach bet­ter de­scribes a wide range of hu­man be­hav­ior, in­clud­ing sub­jec­tive men­tal ex­pe­riences, than any other the­ory of how the mind works, it’s com­pat­i­ble with many other the­o­ries of brain and mind, and it may give us an ad­e­quate way to ground hu­man val­ues pre­cisely enough to be use­ful in AI al­ign­ment.

A pre­dic­tive cod­ing the­ory of hu­man values

My gen­eral the­ory of how to ground hu­man val­ues in min­i­miza­tion of pre­dic­tion er­ror is sim­ple and straight­for­ward:

I’ve thought about this for a while so I have a fairly ro­bust sense in my mind of how this works that al­lows me to ver­ify it against a wide va­ri­ety of situ­a­tions, but I doubt I’ve con­veyed that to you already. I think it will help if I give some ex­am­ples of what this the­ory pre­dicts hap­pens in var­i­ous situ­a­tions that ac­counts for the be­hav­ior peo­ple ob­serve and re­port in them­selves and oth­ers.

  • Mixed emo­tions/​feel­ings are the re­sult of a literal mix of differ­ent con­trol sys­tems un­der the same hi­er­ar­chy re­ceiv­ing pos­i­tive and nega­tive sig­nals as a re­sult of pro­duc­ing less or more pre­dic­tion er­ror.

  • Hard-to-pre­dict peo­ple are per­ceived as creepy or, stated with less nu­ance, bad.

  • Fa­mil­iar things feel good by defi­ni­tion: they are easy to pre­dict.

    • Similarly, there’s a feel­ing of loss (bad) when fa­mil­iar things change.

  • Men­tal ill­nesses re­sult from failures of neu­rons to set good/​bad thresh­olds ap­pro­pri­ately, to up­date set points at an ap­pro­pri­ate rate to match cur­rent rather than old cir­cum­stances, and from sen­sory in­put is­sues caus­ing ei­ther pre­dic­tion er­ror or in­ter­nally cor­rect pre­dic­tions that are poorly cor­re­lated with re­al­ity (this broadly in­clud­ing is­sues re­lated both to sight, sound, smell, taste, touch and to men­tal in­puts from long term mem­ory, short term mem­ory, and oth­er­wise from other neu­rons).

  • De­sire and aver­sion are what it feels like to no­tice pre­dic­tion er­ror is high and for the brain to take ac­tions it pre­dicts will lower it ei­ther by some­thing hap­pen­ing (see­ing sen­sory in­put) or not hap­pen­ing (not see­ing sen­sory in­put), re­spec­tively.

  • Good and bad feel like nat­u­ral cat­e­gories be­cause they are, but ones that are the re­sult of a brain in­ter­act­ing with the world rather than fea­tures of the ex­ter­nally ob­served world.

  • Etc.

Fur­ther ex­plo­ra­tion of these kinds of cases will help in ver­ify­ing the the­ory via whether or not ad­e­quate and straight­for­ward ap­pli­ca­tions of the the­ory can ex­plain var­i­ous phe­nom­ena (I view it as be­ing in a similar epistemic state to evolu­tion­ary psy­chol­ogy, in­clud­ing the threat of mis­lead­ing our­selves with just-so sto­ries). It does to some ex­tent hinge on ques­tions I’m not situ­ated to eval­u­ate ex­per­i­men­tally my­self, es­pe­cially whether or not the brain ac­tu­ally im­ple­ments hi­er­ar­chi­cal con­trol sys­tems of the type de­scribed, but I’m will­ing to move for­ward be­cause even if the brain is not liter­ally made of hi­er­ar­chi­cal con­trol sys­tems the the­ory ap­pears to model what the brain does well enough that what­ever the­ory re­places it will also have to be com­pat­i­ble with many of its pre­dic­tions. Hence I think we can use it as a pro­vi­sional ground­ing even as we keep an eye out for ways in which it may turn out to be an ab­strac­tion that we will have to re­con­sider in the light of fu­ture ev­i­dence, and that work we do based off of it will be amend­able to trans­la­tion to what­ever new, more fun­da­men­tal ground­ing we may dis­cover in the fu­ture.

Re­la­tion to AI alignment

So that’s the the­ory. How does it re­late to AI al­ign­ment?

First note that this the­ory is nat­u­rally a foun­da­tion of ax­iol­ogy, or the study of val­ues, and by ex­ten­sion a foun­da­tion for the study of ethics, to the ex­tent that ethics is about rea­son­ing about how agents, each with their own (pos­si­bly iden­ti­cal) val­ues, in­ter­act. This is rele­vant for rea­sons I and more re­cently Stu­art Arm­strong have ex­plored:

Stu­art has been ex­plor­ing one ap­proach by ground­ing hu­man val­ues in an im­prove­ment on the ab­strac­tion for hu­man val­ues used in in­verse re­in­force­ment learn­ing that I think of as a be­hav­ioral eco­nomics the­ory of hu­man val­ues. My main ob­jec­tion to this ap­proach is that it is be­hav­iorist: it ap­pears to me to be grounded in what can be ob­served from ex­ter­nal hu­man be­hav­ior by other agents and has to in­fer the in­ter­nal states of agents across a large in­fer­en­tial gap, true val­ues be­ing a kind of hid­den and en­cap­su­lated vari­able an agent learns about via ob­served be­hav­ior. To be fair this has proven an ex­tremely use­ful ap­proach over the past 100 years or so in a va­ri­ety of fields, but it also suffers an epistemic prob­lem in that it re­quires lots of in­fer­ence to de­ter­mine val­ues, and I be­lieve this makes it a poor choice given the mag­ni­tude of Good­hart­ing effects we ex­pect to be at risk from with su­per­in­tel­li­gence-lev­els of op­ti­miza­tion.

In com­par­i­son, I view a pre­dic­tive-cod­ing-like the­ory of hu­man val­ues as offer­ing a much bet­ter method of ground­ing hu­man prefer­ences. It is

  • par­si­mo­nious: the be­hav­ioral eco­nomics ap­proach to hu­man val­ues al­lows com­par­a­tively com­pli­cated value speci­fi­ca­tions and re­quires many mod­ifi­ca­tions to make it re­flect a wide va­ri­ety of ob­served hu­man be­hav­ior, whereas this the­ory lets them be speci­fied in sim­ple terms that be­come com­plex by re­cur­sive ap­pli­ca­tion of the same ba­sic mechanism;

  • re­quires lit­tle in­fer­ence: if it is to­tally right, only the in­fer­ence of mea­sur­ing neu­ron ac­tivity cre­ates room for epistemic er­ror within the model;

  • cap­tures in­ter­nal state: true val­ues/​in­ter­nal state is as­sessed as di­rectly as pos­si­ble rather than in­ferred from be­hav­ior;

  • broad: works for both ra­tio­nal and non-ra­tio­nal agents with­out mod­ifi­ca­tion;

  • flex­ible: even if the con­trol the­ory model is wrong, the gen­eral “Bayesian brain” ap­proach is prob­a­bly right enough for us to make use­ful progress over what is pos­si­ble with a be­hav­iorist ap­proach such that we could trans­late work that as­sumes pre­dic­tive cod­ing to an­other, bet­ter model.

Thus I am quite ex­cited about the pos­si­bil­ity that pre­dic­tive cod­ing ap­proach may al­low us to ground hu­man val­ues pre­cisely enough to en­able suc­cess­fully al­ign­ing AI with hu­man val­ues.

This is a first at­tempt to ex­plain what has been my “big idea” for the last year or so now that it has fi­nally come to­gether enough in my head that I’m con­fi­dent pre­sent­ing it, so I very much wel­come feed­back, ques­tions, and com­ments that may help us move to­wards a more com­plete eval­u­a­tion and ex­plo­ra­tion of this idea.