Some Comments on Stuart Armstrong’s “Research Agenda v0.9”

Sub­ject mat­ter here.

I: Intro

I am ex­tremely sym­pa­thetic to the pro­gram of AI safety by un­der­stand­ing value learn­ing. Be­cause of that sym­pa­thy, I have more thoughts than av­er­age prompted by Stu­art Arm­strong’s post along those same lines.

Stu­art’s post mostly deals with “par­tial prefer­ences,” which are like sim­ple state­ments of bi­nary prefer­ence (A is bet­ter than B), but as­so­ci­ated with a con­text—sup­pos­edly the “hu­man’s model” the hu­man was us­ing when they ex­hibited or stated that prefer­ence. Then the post says that you should sort these par­tial prefer­ences ac­cord­ing to meta-lev­els and ag­gre­gate them from the top down, up­dat­ing your pro­ce­dure af­ter you finish each meta-level, even­tu­ally pro­duc­ing a util­ity func­tion over world-his­to­ries.

Broadly, I’d say that my opinion is sort of like the bit­ter les­son. The bit­ter les­son in, say, image recog­ni­tion, is that peo­ple wanted to do image recog­ni­tion with a bunch of hu­man-de­signed fea­tures and for­mal rea­son­ing and hu­man-un­der­stand­able in­ter­nal mov­ing parts, and they tried that for a long time, and what worked was us­ing way big­ger mod­els, way more com­put­ing power, much fewer hu­man-un­der­stand­able in­ter­nal parts, and al­most no hu­man-de­signed fea­tures.

I like Stu­art’s out­line more than most value learn­ing pro­pos­als. But it still strikes me as pri­mar­ily a list of hu­man-de­signed fea­tures and hu­man-un­der­stand­able in­ter­nal mov­ing parts. We might be bet­ter off throw­ing away some of the de­tails and ab­stract­ing in a way that al­lows for some of these prob­lems to be solved by big mod­els and com­put­ing power.

It’s like the just-so story about ResNets, which is that they’re a fix to hu­mans think­ing the in­sides of neu­ral nets should look too much like hu­man logic[^1]. I think spec­u­lat­ing about the hu­man-sized log­i­cal re­la­tion­ships be­tween spec­u­la­tive parts in­side the AI is eas­ier but less use­ful than spec­u­lat­ing about the al­gorithm that will con­nect your in­puts to your out­puts with a big model and lots of com­put­ing power, which may or may not have your log­i­cal steps as emer­gent fea­tures.

II: A long anal­ogy about dams

If you want to de­sign a dam, you don’t draw the blueprint of the dam first and figure out what ma­te­ri­als it should be made of later—first you learn a lot about hy­drol­ogy and ma­te­ri­als sci­ence so you know how steel and con­crete and earth and wa­ter in­ter­act, then you draw the high-level de­sign, then you fill in the de­tails that weren’t dic­tated ei­ther by physics or by your de­sign goals. I’m claiming that we don’t yet know much about the steel and wa­ter of value learn­ing.

Here’s a long di­gres­sion as an ex­am­ple. Sup­pose you’re try­ing to work out how to model hu­man val­ues the way hu­mans do, even given lots of com­put­ing power and data. If you want to lo­cate val­ues within a model of hu­mans, you can’t just train the model for pre­dic­tive power, be­cause hu­man val­ues only ap­pear in a nar­row zone of ab­strac­tion, more ab­stract than biol­ogy and less ab­stract than pop­u­la­tion statis­tics, and an AI scored only on pre­dic­tion will be pres­sured to go to a lower level of ab­strac­tion.

If you train an AI on a shared in­put of sen­sory data and a text chan­nel from hu­mans, will it learn a shared model of the world and the text chan­nel that effec­tively solves the sym­bol ground­ing prob­lem? Can you then ac­ti­vate de­sired con­cepts through the text chan­nel, “cheat­ing” a solu­tion to lots of value learn­ing prob­lems?

No. Con­sider what hap­pens in the limit of lots of re­sources, par­tic­u­larly if we are train­ing this model for pre­dic­tive power—it will be pres­sured to­wards a lower level of ab­strac­tion. Once it starts en­cod­ing the world differ­ently than we do, it won’t have the gen­er­al­iza­tion prop­er­ties we want—we’d be caught cheat­ing, as it were. And if we could solve the train­ing prob­lem for ver­bal mod­els, it seems like we could just solve the train­ing prob­lem to learn the con­cepts we want to learn. But maybe there’s still some way to “cheat” in prac­tice.

Another way to think of this prob­lem is as mak­ing “ar­tifi­cial in­ten­tional stance.” But we have to re­mem­ber that the in­ten­tional stance is not just a sin­gle model (and definitely not the as­sump­tion that hu­mans are like homo eco­nomi­cus.) It’s a fam­ily of strate­gies used to learn about hu­mans, model hu­mans, and model in­ter­act­ing with hu­mans. Stances aren’t just an as­sump­tion about how to model one thing within a fixed model of the world, they’re part of com­plete lan­guages for talk­ing about the world.

I want to know how to de­sign an AI that not only de­vel­ops ap­prox­i­mate ways of un­der­stand­ing the world, but matches some of those ways of un­der­stand­ing to what it sees hu­mans use. But even to do this, we don’t re­ally know how to talk in a prin­ci­pled way about what it is that it’s sup­posed to be match­ing. So we’ve got to think about that.

This is an ex­am­ple of the sort of con­sid­er­a­tion that I think is ur­gent and in­ter­est­ing—and you can’t always leave it as a de­tail to be filled in later, be­cause de­pend­ing on the base ma­te­ri­als, the best de­sign might be quite differ­ent.

III: Mis­cel­la­neous spe­cific comments

Now some more spe­cific com­ments about the pro­posal.

- How much of the hid­den de­tails are in elic­it­ing par­tial prefer­ences? I’ve sort of been im­ply­ing that it’s a lot. Does it re­quire a gen­eral ar­tifi­cial in­ten­tional stance to ex­tract not just bi­nary prefer­ences but also the model the hu­man is us­ing to ex­press those prefer­ences?

- How much of the hid­den de­tails are in do­ing meta-rea­son­ing? If I don’t trust an AI, more steps of meta-rea­son­ing makes me trust it even less—hu­mans of­ten say things about meta-rea­son­ing that would be dis­as­trous if im­ple­mented. What kind of amaz­ing fac­ul­ties would be re­quired for an AI to ex­tract par­tial prefer­ences about meta-rea­son­ing that ac­tu­ally made things bet­ter rather than worse? If I was bet­ter at un­der­stand­ing what the de­tails ac­tu­ally are, maybe I’d pick on meta-rea­son­ing more.

I do agree that the meta-rea­son­ing step is nec­es­sary for this scheme, but I think that’s be­cause this scheme doesn’t in­volve the AI build­ing an ex­plicit model of hu­mans to provide con­sis­tency—it’s re­peat­edly out­sourc­ing the mod­el­ing job to am­ne­siac sin­gle-shot mod­ules. If hu­mans were re­li­able sources about meta-rea­son­ing prin­ci­ples for com­bin­ing bi­nary prefer­ences, this would work great, but since they aren’t it won’t—a low-level prac­ti­cal con­cern dic­tat­ing higher-level de­sign.

- The “sym­bol ground­ing mod­ule”‘s job seems to be to take the par­tial prefer­ences in­side the par­tial prefer­ences’ con­tex­tual mod­els and trans­late them into full prefer­ences in the AI’s na­tive on­tol­ogy. This seems like it re­quires the AI to have a re­ally trust­wor­thy grasp on the in­ten­tional stance and its vari­a­tions—maybe I should imag­ine this as com­ing from the same pro­cess that origi­nates those con­tex­tual mod­els for par­tial prefer­ences in the first place. This is a bit differ­ent than the sym­bol ground­ing I nor­mally think about (ground­ing of in­ter­nal sym­bols by their causal re­la­tion­ship to re­al­ity), but I agree it’s an im­por­tant part of the ar­tifi­cial in­ten­tional stance.



[^1]: The story goes some­thing like this: When peo­ple first thought of neu­ral net­works, they thought of as if each neu­ron was a log­i­cal node mak­ing a hu­man-sized step in rea­son­ing. And so they op­ti­mized the ini­tial­iza­tion of weights and the non­lin­ear­ity for each in­di­vi­d­ual neu­ron func­tion­ing like a dis­crim­i­na­tor. But af­ter many years of work, peo­ple re­al­ized that the “neu­rons are do­ing hu­man-sized log­i­cal steps” model wasn’t the best, and a bet­ter pic­ture is that the neu­ral net­work is mas­sag­ing the in­put man­i­fold around in a higher-di­men­sional space un­til even­tu­ally the in­put space gets trans­formed into some­thing that’s easy to clas­sify. And so the peo­ple de­vel­oped ResNets that were spe­cial­ized for this grad­ual mas­sag­ing of the in­put into the out­put, and they worked great.