Notes from a conversation on act-based and goal-directed systems

I had a con­ver­sa­tion with Nate about differ­ent pos­si­ble goal sys­tems for agents, and I think some peo­ple will be in­ter­ested in read­ing this sum­mary.

Goal specification

I started by stat­ing my skep­ti­cism about ap­proaches to goal speci­fi­ca­tion that rely on in­spect­ing an AI’s world model and iden­ti­fy­ing some con­cept in them (e.g. pa­per­clips) to spec­ify the goal in terms of. To me, this seems fairly doomed: it is difficult to imag­ine a kind of lan­guage for de­scribing con­cepts, such that I could spec­ify some con­cept I cared about (e.g. pa­per­clips) in this lan­guage, and I could trust a sys­tem to cor­rectly carry out a goal speci­fied in terms of this con­cept. Even if we had a nicer the­ory of multi-level mod­els, it still seems un­likely that this the­ory would match hu­man con­cepts well enough that it would be pos­si­ble to spec­ify things we care about in this the­ory. See also Paul’s com­ment on this sub­ject and his post on un­su­per­vised learn­ing.

Nate re­sponded that it seems like hu­mans can learn a con­cept from fairly few ex­am­ples. To the ex­tent that we ex­pect AIs to learn “nat­u­ral cat­e­gories”, and we ex­pect to be able to point at nat­u­ral cat­e­gories with a few ex­am­ples or views of the con­cept, this might work.

Nate ar­gued that cor­rigi­bil­ity might be a nat­u­ral con­cept, and one that is use­ful for spec­i­fy­ing some proxy for what we care about. This is par­tially due to in­tro­spec­tion on the con­cept of cor­rigi­bil­ity (“know­ing that you’re flawed and that the goal you were given is not an ac­cu­rate re­flec­tion of your pur­pose”), and par­tially due to the fact that su­per­in­tel­li­gences might want to build cor­rigible sub­agents.

This didn’t seem com­pletely im­plau­si­ble to me, but it didn’t seem very likely that this would end up sav­ing the goal-di­rected ap­proach. Then we started get­ting into the de­tails of al­ter­na­tive pro­pos­als that spec­ify goals in terms of short-term pre­dic­tions (speci­fi­cally, hu­man-imi­ta­tion and other act-based ap­proaches).

I ar­gued that there’s an im­por­tant ad­van­tage to sys­tems whose goals are grounded in short-term pre­dic­tions: you can use a scheme like this to do some­thing use­ful if you have a mix­ture of good and bad pre­dic­tors, by test­ing these pre­dic­tors against re­al­ity. There is no analo­gous way of test­ing e.g. good and bad pa­per­clip con­cepts against re­al­ity, to see which one ac­tu­ally rep­re­sents pa­per­clips. Nate agreed that this is an ad­van­tage for ground­ing goals in pre­dic­tion. In par­tic­u­lar, he agreed that spec­i­fy­ing goals in terms of hu­man pre­dic­tions will likely be the best idea for the first pow­er­ful AGIs, al­though he’s less pes­simistic than me about other ap­proaches.

Nate pointed out some prob­lems with sys­tems based on pow­er­ful pre­dic­tors. If a pre­dic­tor can pre­dict a sys­tem con­tain­ing con­se­quen­tial­ists (e.g. a hu­man in a room), then it is us­ing some kind of con­se­quen­tial­ist ma­chin­ery in­ter­nally to make these pre­dic­tions. For ex­am­ple, it might be mod­el­ling the hu­man as an ap­prox­i­mately ra­tio­nal con­se­quen­tial­ist agent. This pre­sents some prob­lems. If the pre­dic­tor simu­lates con­se­quen­tial­ist agents in enough de­tail, then these agents might try to break out of the sys­tem. Pre­sum­ably, we would want to know that these con­se­quen­tial­ists are safe. It’s pos­si­ble that the scheme for han­dling pre­dic­tors works for pre­vent­ing these con­se­quen­tial­ists from gain­ing much power, but a “defense in depth” ap­proach would in­volve un­der­stand­ing these con­se­quen­tial­ists bet­ter. Ad­di­tion­ally, the fact that the pre­dic­tor uses con­se­quen­tial­ist rea­son­ing in­di­cates that you prob­a­bly need to un­der­stand con­se­quen­tial­ist rea­son­ing to build the pre­dic­tor in the first place.

In par­tic­u­lar, at least one of the con­se­quen­tial­ists in the world model must rep­re­sent a hu­man for the pre­dic­tor to make ac­cu­rate pre­dic­tions of hu­mans. It’s sub­stan­tially eas­ier to spec­ify a class of mod­els that con­tains a good ap­prox­i­ma­tion to a hu­man (which might be all you need for hu­man-pre­dic­tion ap­proaches) than to spec­ify a good ap­prox­i­ma­tion to a hu­man, but it still seems difficult ei­ther way. It’s pos­si­ble that a bet­ter un­der­stand­ing of con­se­quen­tial­ism will lead to bet­ter mod­els for hu­man-pre­dic­tion (al­though at the mo­ment, this seems like a fairly weak rea­son to study con­se­quen­tial­ism to me).

Logic optimizers

We also talked about the idea of a “logic op­ti­mizer”. This is a hy­po­thet­i­cal agent that is given a de­scrip­tion of the en­vi­ron­ment it is in (as a com­puter pro­gram) and op­ti­mizes this en­vi­ron­ment ac­cord­ing to some eas­ily-defined ob­jec­tive (similar to modal UDT). One tar­get might be a “nat­u­ral­ized AIXI”, which in some sense does this job al­most as well as any sim­ple Tur­ing ma­chine. This should be an asymp­totic solu­tion that works well in an en­vi­ron­ment larger than it, as both it and the en­vi­ron­ment be­come very large.

I was skep­ti­cal that this re­search path gives us what we want. The things we ac­tu­ally care about can’t be ex­pressed eas­ily in terms of physics or logic. Nate pre­dicted that, if he un­der­stood how to build a nat­u­ral­ized AIXI, then this would make some other things less con­fus­ing. He would have more ideas for what to do af­ter find­ing this: per­haps mak­ing the sys­tem more effi­cient, or ex­tend­ing it to op­ti­mize higher-level as­pects of physics/​logic.

It seems to me that the place where you would ac­tu­ally use a logic op­ti­mizer is not to op­ti­mize real-world physics, but to op­ti­mize the in­ter­nal or­ga­ni­za­tion of the AI. Since the AI’s in­ter­nal or­ga­ni­za­tion is defined as a com­puter pro­gram, it is fairly easy to spec­ify goals re­lated to the in­ter­nal or­ga­ni­za­tion in a for­mat suit­able for a logic op­ti­mizer (e.g. spec­i­fy­ing the goal of max­i­miz­ing a given math­e­mat­i­cal func­tion). This seems iden­ti­cal to the idea of “pla­tonic goals”. It’s pos­si­ble that the in­sights from un­der­stand­ing logic op­ti­miz­ers might gen­er­al­ize to more real-world goals, but I find in­ter­nal or­ga­ni­za­tion to be the most com­pel­ling con­crete ap­pli­ca­tion.

Paul has also writ­ten about us­ing con­se­quen­tial­ism for the in­ter­nal or­ga­ni­za­tion of an AI sys­tem. He ar­gues that, when you’re us­ing con­se­quen­tial­ism to e.g. op­ti­mize a math­e­mat­i­cal func­tion, even very bad the­o­ret­i­cal tar­gets for what this means seem fine. I par­tially agree with this: it seems like there is much more er­ror tol­er­ance for badly op­ti­miz­ing a math­e­mat­i­cal func­tion, ver­sus badly op­ti­miz­ing the uni­verse. In par­tic­u­lar, if you have a set of func­tion op­ti­miz­ers that con­tains a good func­tion op­ti­mizer, then you can easy com­bine these func­tion op­ti­miz­ers into a sin­gle good func­tion op­ti­mizer (just take the argmax over their out­puts). The main dan­ger is if all of your best “func­tion op­ti­miz­ers” ac­tu­ally care about the real world, be­cause you didn’t know how to build one that only cares about the in­ter­nal ob­jec­tive.

Paul is skep­ti­cal that a bet­ter the­o­ret­i­cal for­mu­la­tion of ra­tio­nal agency would ac­tu­ally help to de­sign more effec­tive and un­der­stand­able in­ter­nal op­ti­miz­ers (e.g. func­tion op­ti­miz­ers). It seems likely that we’ll be stuck with an­a­lyz­ing the al­gorithms that end up work­ing, rather than de­sign­ing al­gorithms ac­cord­ing to the­o­ret­i­cal tar­gets.

I talked to Nate about this and he was more op­ti­mistic about get­ting use­ful in­ter­nal op­ti­miz­ers if we know how to solve logic op­ti­miza­tion prob­lems us­ing a hy­per­com­puter (in an asymp­totic way that works when the agent is smaller than the en­vi­ron­ment). He was skep­ti­cal about ways of “solv­ing” the prob­lem with­out be­ing able to ac­com­plish this seem­ingly eas­ier goal.

I’m not sure what to think about how use­ful the­ory is. The most ob­vi­ous par­allel is to look at for­mal­isms like Solomonoff in­duc­tion and AIXI, and see if those have helped to make cur­rent ma­chine learn­ing sys­tems more prin­ci­pled. I don’t have a great idea of what most im­por­tant AI re­searchers think of AIXI, but I think it’s helped me to un­der­stand what some ma­chine learn­ing sys­tems are ac­tu­ally do­ing. Some of the peo­ple who worked with these the­o­ret­i­cal for­mal­isms (Juer­gen Sch­mid­hu­ber, Shane Legg, per­haps oth­ers?) went on to make ad­vances in deep learn­ing, which seems like an ex­am­ple of us­ing a prin­ci­pled the­ory to un­der­stand a less-prin­ci­pled al­gorithm bet­ter. It’s im­por­tant to dis­en­tan­gle “un­der­stand­ing AIXI helped these peo­ple make deep learn­ing ad­vances” from “more com­pe­tent re­searchers are more drawn to AIXI”, but I would still guess that study­ing AIXI helped them. Another prob­lem with this anal­ogy is that, if nat­u­ral­ized AIXI is the right paradigm in a way that AIXI isn’t, then it is more likely to yield prac­ti­cal al­gorithms than AIXI is.

Roughly, if nat­u­ral­ized AIXI is a com­pa­rable the­o­ret­i­cal ad­vance to Solomonoff in­duc­tion/​AIXI (which seems likely), then I am some­what op­ti­mistic about it mak­ing fu­ture AI sys­tems more prin­ci­pled.

Con­clu­sion and re­search priorities

My con­crete take­aways are:

  1. Spec­i­fy­ing real-world goals in a way that doesn’t re­duce to short-term hu­man pre­dic­tion doesn’t seem promis­ing for now. New in­sights might make this prob­lem look eas­ier, but this doesn’t seem very likely to me.

  2. To the ex­tent that we ex­pect pow­er­ful sys­tems to need to use con­se­quen­tial­ist rea­son­ing to or­ga­nize their in­ter­nals, and to the ex­tent that we can make the­o­ret­i­cal progress on the prob­lem, it seems worth work­ing on a “nat­u­ral­ized AIXI”. It looks like a long shot, but it seems rea­son­able to at least gather in­for­ma­tion about how easy it is for us to make progress on it by try­ing to solve it.

In the near fu­ture, I think I’ll split my time be­tween (a) work re­lated to act-based sys­tems (roughly fol­low­ing Paul’s recom­mended re­search agenda), and (b) work re­lated to logic op­ti­miz­ers, with em­pha­sis on us­ing these for the in­ter­nal or­ga­ni­za­tion of the AI (rather than goals re­lated to the real world). Pos­si­bly, some work will be rele­vant to both of these pro­jects. I’ll prob­a­bly change my re­search pri­ori­ties if any of a few things hap­pens:

  1. goals re­lated to the ex­ter­nal world start seem­ing less doomed to me

  2. the act-based ap­proach starts seem­ing more doomed to me

  3. the “nat­u­ral­ized AIXI” ap­proach starts seem­ing more or less use­ful/​tractable

  4. I find use­ful things to do that don’t seem rele­vant to ei­ther of these two projects