Towards an Intentional Research Agenda

This post is mo­ti­vated by re­search in­tu­itions that bet­ter for­mal­isms in con­scious­ness re­search con­tribute to agent foun­da­tions in more ways than just the value load­ing prob­lem. Epistemic sta­tus: spec­u­la­tive.

David Marr’s lev­els of anal­y­sis is the idea that any anal­y­sis of a sys­tem in­volves an­a­lyz­ing it at mul­ti­ple, dis­tinct lev­els of ab­strac­tion. Th­ese lev­els are the com­pu­ta­tional, which de­scribes what it is the sys­tem is try­ing to do, the al­gorith­mic, which de­scribes which al­gorithms the sys­tem in­stan­ti­ates in or­der to ac­com­plish that goal, and the im­ple­men­ta­tion level, de­scribing the hard­ware or sub­strate on which the sys­tem is run­ning. Each level un­der­de­ter­mines the other lev­els. You can choose lots of differ­ent al­gorithms for a given goal, and al­gorithms don’t re­strict which goals can use them. A con­crete ex­am­ple Marr uses, is that you’d have a very hard time figur­ing out what a feather was for if you’d never seen a bird fly­ing, and if you only saw a bird fly­ing you might have a very difficult time com­ing up with some­thing like the de­sign of a feather.

Imag­ine a world that had re­cently in­vented com­put­ers. The early ex­am­ples are very prim­i­tive, but peo­ple can ex­trap­o­late and see that these things will be very pow­er­ful, likely trans­for­ma­tive to so­ciety. They’re pretty con­cerned about the po­ten­tial for these changes to be harm­ful, maybe even catas­trophic. Although peo­ple have done a bit of the­o­ret­i­cal work on al­gorithms, it isn’t all that so­phis­ti­cated. But since the stakes are high, they try their best to start figur­ing out what it would mean for there to be such a thing as harm­ful al­gorithms, or how to bound gen­eral use al­gorithms such that they can only be used for cer­tain things. They even make some good progress, com­ing up with the con­cept of ASICs so that they can maybe hard code the good al­gorithms and make it im­pos­si­ble to run the bad. They’re still con­cerned that a suffi­ciently clever or suffi­ciently in­cen­tivized agent could use ASICs for bad ends some­how.

If this situ­a­tion seems a bit ab­surd to you, it’s be­cause you in­tu­itively rec­og­nize that the hard­ware level un­der­de­ter­mines the al­gorith­mic level. I ar­gue the pos­si­bil­ity that we’re mak­ing the same er­ror now. The al­gorith­mic level un­der­de­ter­mines the com­pu­ta­tional level, and no mat­ter how many com­bi­na­tions of clev­erly con­structed al­gorithms you stack on them­selves, you won’t be able to bound the space of pos­si­ble goals in a way that gets you much more than weak guaran­tees. In par­tic­u­lar, a sys­tem con­structed with the right in­ten­tional for­mal­ism should ac­tively want to avoid be­ing good­harted just like a hu­man does. Such an agent should have knigh­tian un­cer­tainty and there­fore also (po­ten­tially) avoid max­i­miz­ing.

In physics (or the im­ple­men­ta­tion level) there are no­tions of small­est units, and count­ing up the differ­ent ways these units can be com­bined cre­ates the no­tion of ther­mo­dy­namic en­tropy, we can also eas­ily define dis­tance func­tions. In in­for­ma­tion the­ory (or the al­gorith­mic level) there are no­tions of bits, and count­ing up the differ­ent ways these bits could be cre­ates the no­tion of in­for­ma­tion the­o­retic en­tropy, we can also define dis­tance func­tions. I think we need to build a no­tion of units of in­ten­tion­al­ity (on the com­pu­ta­tion level), and mea­sures of per­mu­ta­tions of ways these units can be to give a no­tion of in­ten­tional (com­pu­ta­tional) en­tropy, along with get­ting what could turn out to be a key in­sight for al­ign­ing AI, a dis­tance func­tion be­tween in­ten­tions.

In the same way that try­ing to build com­plex in­for­ma­tion pro­cess­ing sys­tems with­out a con­crete no­tion of in­for­ma­tion would be quite con­fus­ing, I claim that try­ing to build com­plex in­ten­tional sys­tems with­out a con­crete no­tion of in­ten­tion is con­fus­ing. This may sound a bit far fetched, but I claim that it is ex­actly as hard to think about as in­for­ma­tion the­ory was be­fore Shan­non found a for­mal­ism that worked.

I think there are already sev­eral beach­heads for this prob­lem that are sug­ges­tive:

Pre­dic­tive pro­cess­ing (re­la­tion to small­est units of in­ten­tion).
In par­tic­u­lar, one can­di­date for small­est unit is the small­est unit that a given feed­back cir­cuit (like a ther­mo­stat) can ac­tu­ally dis­t­in­guish. We hu­mans get around this by trans­lat­ing from sys­tems in which we can make fewer dis­tinc­tions (like say heat) into sys­tems in which we can make more (like say our sym­bolic pro­cess­ing of vi­sual in­for­ma­tion in the form of num­bers).

Con­ver­gent in­stru­men­tal goals (struc­tural in­var­i­ants in goal sys­tems).
In par­tic­u­lar I think it would be worth in­ves­ti­gat­ing differ­ing in­tu­itions about just how much a forc­ing func­tion con­ver­gent in­stru­men­tal goals are. Do we ex­pect a uni­verse op­ti­mized by a ca­pa­bil­ity boosted Gandhi and Clippy to be 10% similar, 50%, 90% or per­haps 99.9999+% similar?

Mo­dal Logic (re­la­tion to coun­ter­fac­tu­als and as se­man­tics for the in­ten­tion­al­ity of be­liefs).

Good­hart’s tax­on­omy be­gins to pa­ram­e­ter­ize, and there­fore define dis­tance func­tions for di­ver­gence of in­tent.

Some other ques­tions:

How do sim­ple in­ten­tions get com­bined to form more com­plex in­ten­tions? I think this is tractable via ex­per­i­men­ta­tion with sim­ple cir­cuits. This could also sug­gest ap­proaches to pre-ra­tio­nal­ity via ex­plain­ing (rigor­ously) how com­plex pri­ors arise from home­o­static pri­ors.

In Bud­dhism, in­ten­tion is con­sid­ered syn­ony­mous with con­scious­ness, while in the west this is con­sid­ered a con­tentious claim. What sim­ple facts, if known, would col­lapse the seem­ing com­plex­ity here?

Can we con­sider in­ten­tions as a query lan­guage? If so, what use­ful ideas or re­sults can we port over from database sci­ence? Is the ap­par­ent com­plex­ity of hu­man val­ues a side effect of the di­men­sion­al­ity of the space more so than the de­gree of re­s­olu­tion on any par­tic­u­lar di­men­sion?


When I read vague posts like this my­self, I some­times have vague ob­jec­tions but don’t write them up due to the effort to bridge the in­fer­en­tial dis­tance to the au­thor and also the sense that the au­thor will in­ter­pret at­tempts to bridge that dis­tance as harsher crit­i­cism than I in­tend. Please feel free to give half formed crit­i­cism and leave me to fill in the blanks. It might poke my own half formed thoughts in this area in an in­ter­est­ing way.