Selection vs Control


This is some­thing which has both­ered me for a while, but, I’m writ­ing it speci­fi­cally in re­sponse to the re­cent post on mesa-op­ti­miz­ers.

I feel strongly that the no­tion of ‘op­ti­miza­tion pro­cess’ or ‘op­ti­mizer’ which peo­ple use—partly de­rived from Eliezer’s no­tion in the se­quences—should be split into two clusters. I call these two clusters ‘se­lec­tion’ vs ‘con­trol’. I don’t have pre­cise for­mal state­ments of the dis­tinc­tion I’m point­ing at; I’ll give sev­eral ex­am­ples.

Be­fore go­ing into it, sev­eral rea­sons why this sort of thing may be im­por­tant:

  • It could help re­fine the dis­cus­sion of mesa-op­ti­miza­tion. The ar­ti­cle re­stricted its dis­cus­sion to the type of op­ti­miza­tion I’ll call ‘se­lec­tion’, ex­plic­itly rul­ing out ‘con­trol’. This choice isn’t ob­vi­ously right. (More on this later.)

  • Refin­ing ‘agency-like’ con­cepts like this seems im­por­tant for em­bed­ded agency—what we even­tu­ally want is a story about how agents can be in the world. I think al­most any dis­cus­sion of the re­la­tion­ship be­tween agency and op­ti­miza­tion which isn’t aware of the dis­tinc­tion I’m draw­ing here (at least as a hy­poth­e­sis) will be con­fused.

  • Gen­er­ally, I feel like I see peo­ple mak­ing mis­takes by not dis­t­in­guish­ing be­tween the two. I judge an al­gorithm differ­ently if it is in­tended as one or the other.

(See also Stu­art Arm­strong’s sum­mary of other prob­lems with the no­tion of op­ti­miza­tion power Eliezer pro­posed—those are un­re­lated to my dis­cus­sion here, and strike me more as tech­ni­cal is­sues which call for re­fined for­mu­lae, rather than con­cep­tual prob­lems which call for re­vised on­tol­ogy.)

The Ba­sic Idea

Eliezer quan­tified op­ti­miza­tion power by ask­ing how small a tar­get an op­ti­miza­tion pro­cess hits, out of a space of pos­si­bil­ities. The type of ‘space of pos­si­bil­ities’ is what I want to poke at here.

Selection

First, con­sider a typ­i­cal op­ti­miza­tion al­gorithm, such as simu­lated an­neal­ing. The al­gorithm con­structs an el­e­ment of the search space (such as a spe­cific com­bi­na­tion of weights for a neu­ral net­work), gets feed­back on how good that el­e­ment is, and then tries again. Over many iter­a­tions of this pro­cess, it finds bet­ter and bet­ter el­e­ments. Even­tu­ally, it out­puts a sin­gle choice.

This is the pro­to­typ­i­cal ‘se­lec­tion pro­cess’—it can di­rectly in­stan­ti­ate any el­e­ment of the search space (al­though typ­i­cally we con­sider cases where the pro­cess doesn’t have time to in­stan­ti­ate all of them), it gets di­rect feed­back on the qual­ity of each el­e­ment (al­though eval­u­a­tion may be costly, so that the se­lec­tion pro­cess must econ­o­mize these eval­u­a­tions), the qual­ity of an el­e­ment of search space does not de­pend on the pre­vi­ous choices, and only the fi­nal out­put mat­ters.

The term ‘se­lec­tion pro­cess’ refers to the fact that this type of op­ti­miza­tion se­lects be­tween a num­ber of ex­plic­itly given pos­si­bil­ities. The most ba­sic ex­am­ple of this phe­nomenon is a ‘filter’ which re­jects some el­e­ments and ac­cepts oth­ers—like se­lec­tion bias in statis­tics. This has a limited abil­ity to op­ti­mize, how­ever, be­cause it al­lows only one iter­a­tion. Nat­u­ral se­lec­tion is an ex­am­ple of much more pow­er­ful op­ti­miza­tion oc­cur­ring through iter­a­tion of se­lec­tion effects.

Control

Now, con­sider a tar­get­ing sys­tem on a rocket—let’s say, a heat-seek­ing mis­sile. The mis­sile has sen­sors and ac­tu­a­tors. It gets feed­back from its sen­sors, and must some­how use this in­for­ma­tion to de­cide how to use its ac­tu­a­tors. This is my pro­to­typ­i­cal con­trol pro­cess. (The term ‘con­trol pro­cess’ is sup­posed to in­voke con­trol the­ory.) Un­like a se­lec­tion pro­cess, a con­trol­ler can only in­stan­ti­ate one el­e­ment of the space of pos­si­bil­ities. It gets to tra­verse ex­actly one path. The ‘small tar­get’ which it hits is there­fore ‘small’ with re­spect to a space of coun­ter­fac­tual pos­si­bil­ities, with all the tech­ni­cal prob­lems of eval­u­at­ing coun­ter­fac­tu­als. We only get full feed­back on one out­come (al­though we usu­ally con­sider cases where the par­tial feed­back we get along the way gives us a lot of in­for­ma­tion about how to nav­i­gate to­ward bet­ter out­comes). Every de­ci­sion we make along the way mat­ters, both in terms of in­fluenc­ing to­tal util­ity, and in terms of in­fluenc­ing what pos­si­bil­ities we have ac­cess to in sub­se­quent de­ci­sions.

So: in eval­u­at­ing the op­ti­miza­tion power of a se­lec­tion pro­cess, we have a fairly ob­jec­tive situ­a­tion on our hands: the space of pos­si­bil­ities is ex­plic­itly given; the util­ity func­tion is ex­plic­itly given; we can com­pare the true out­put of the sys­tem to a ran­domly cho­sen el­e­ment. In eval­u­at­ing the op­ti­miza­tion power of a con­trol pro­cess, we have a very sub­jec­tive situ­a­tion on our hands: the con­trol­ler only truly takes one path, so any judge­ment about a space of pos­si­bil­ities re­quires us to define coun­ter­fac­tu­als; it is less clear how to define an un-op­ti­mized baseline; util­ity need not be ex­plic­itly rep­re­sented in the con­trol­ler, so may have to be in­ferred (or we think of it as pa­ram­e­ter, so, we can mea­sure op­ti­miza­tion power with re­spect to differ­ent util­ity func­tions, but there’s no ‘cor­rect’ one to mea­sure).

I do think both of these con­cepts are mean­ingful. I don’t want to re­strict ‘op­ti­miza­tion’ to re­fer to only one or the other, as the mesa-op­ti­miza­tion es­say does. How­ever, I think the two con­cepts are of a very differ­ent type.

Bot­tle­caps & Thermostats

The mesa-op­ti­mizer write-up made the de­ci­sion to fo­cus on what I call se­lec­tion pro­cesses, ex­clud­ing con­trol pro­cesses:

We will say that a sys­tem is an op­ti­mizer if it is in­ter­nally search­ing through a search space (con­sist­ing of pos­si­ble out­puts, poli­cies, plans, strate­gies, or similar) look­ing for those el­e­ments that score high ac­cord­ing to some ob­jec­tive func­tion that is ex­plic­itly rep­re­sented within the sys­tem. [...] For ex­am­ple, a bot­tle cap causes wa­ter to be held in­side the bot­tle, but it is not op­ti­miz­ing for that out­come since it is not run­ning any sort of op­ti­miza­tion al­gorithm.(1) Rather, bot­tle caps have been op­ti­mized to keep wa­ter in place.

It makes sense to say that we aren’t wor­ried about bot­tle­caps when we think about the in­ner al­ign­ment prob­lem. How­ever, this also ex­cludes much more pow­er­ful ‘op­ti­miz­ers’—some­thing more like a plant.

When does a pow­er­ful con­trol pro­cess be­come an ‘agent’?

  • Bot­tle­caps: No mean­ingful ac­tu­a­tors or sen­sors. Essen­tially inan­i­mate. Does a par­tic­u­lar job, pos­si­bly very well, but in a very pre­dictable man­ner.

  • Ther­mostats: Im­ple­ments a nega­tive feed­back loop via a sen­sor, an ac­tu­a­tor, and a policy of “cor­rect­ing” things when sense-data in­di­cates they are “off”. Ac­tual ther­mostats ex­plic­itly rep­re­sent the tar­get tem­per­a­ture, but one can imag­ine things in this cluster which wouldn’t—in gen­eral, the con­nec­tion be­tween what is sensed and how things are ‘cor­rected’ can be quite com­plex (in­volv­ing many differ­ent sen­sors and ac­tu­a­tors), so that no one place in the sys­tem ex­plic­itly rep­re­sents the ‘tar­get’.

  • Plants: Plants are like very com­plex ther­mostats. They have no ap­par­ent ‘tar­get’ ex­plic­itly rep­re­sented, but can clearly be thought of as rel­a­tively agen­tic, achiev­ing com­pli­cated goals in com­pli­cated en­vi­ron­ments.

  • Guided Mis­siles: Th­ese are also mostly in the ‘ther­mo­stat’ cat­e­gory, but, guided mis­siles can use sim­ple world-mod­els (to track the lo­ca­tion of the tar­get). How­ever, any ‘plan­ning’ is likely based on ex­plicit for­mu­lae rather than any search. (I’m not sure about ac­tual guided mis­siles.) If so, a guided mis­sile would still not be a se­lec­tion pro­cess, and there­fore lack a “goal” in the mesa-op­ti­mizer sense, de­spite hav­ing a world-model and ex­plic­itly rea­son­ing about how to achieve an ob­jec­tive rep­re­sented within that world-model.

  • Chess Pro­grams: A chess-play­ing pro­gram has to play each game well, and ev­ery move is sig­nifi­cant to this goal. So, it is a con­trol pro­cess. How­ever, AI chess al­gorithms are based on ex­plicit search. Many, many moves are con­sid­ered, and each move is eval­u­ated in­de­pen­dently. This is a com­mon pat­tern. The best way we know how to im­ple­ment very pow­er­ful con­trol­lers is to use search in­side (im­ple­ment­ing a con­trol pro­cess us­ing a se­lec­tion pro­cess). At that point, a con­trol­ler seems clearly ‘agent-like’, and falls within the defi­ni­tion of op­ti­mizer used in the meso-op­ti­miza­tion post. How­ever, it seems to me that things be­come ‘agent-like’ some­where be­fore this stage.

(See also: adap­ta­tion-ex­e­cuters, not fit­ness max­i­miz­ers.)

I don’t want to frame it as if there’s “one true dis­tinc­tion” which we should be mak­ing, which I’m claiming the mesa-op­ti­miza­tion write-up got wrong. Rather, we should pay at­ten­tion to the differ­ent dis­tinc­tions we might make, study­ing the phe­nom­ena sep­a­rately and con­sid­er­ing the al­ign­ment/​safety im­pli­ca­tions of each.

This is closely re­lated to the dis­cus­sion of up­stream dae­mons vs down­stream dae­mons. A down­stream-dae­mon seems more likely to be an op­ti­mizer in the sense of the mesa-op­ti­miza­tion write-up; it is ex­plic­itly plan­ning, which may in­volve search. Th­ese are more likely to raise con­cerns through ex­plic­itly rea­soned out treach­er­ous turns. An up­stream-dae­mon could use ex­plicit plan­ning, but it could also be only a bot­tle­cap/​ther­mo­stat/​plant. It might pow­er­fully op­ti­mize for some­thing in the con­trol­ler sense with­out in­ter­nally us­ing se­lec­tion. This might pro­duce se­vere mis­al­ign­ment, but not through ex­plic­itly planned treach­er­ous turns. (Caveat: we don’t un­der­stand mesa-op­ti­miz­ers; an un­der­stand­ing suffi­cient to make state­ments such as these with con­fi­dence would be a sig­nifi­cant step for­ward.)

It seems pos­si­ble that one could in­vent a mea­sure of “con­trol power” which would rate highly-op­ti­mized-but-inan­i­mate ob­jects like bot­tle­caps very low, while giv­ing a high score to ther­mo­stat-like ob­jects which set up com­pli­cated nega­tive feed­back loops (even if they didn’t use any search).

Pro­cesses Within Processes

I already men­tioned the idea that the best way we know how to im­ple­ment pow­er­ful con­trol pro­cesses is through pow­er­ful se­lec­tion (search) in­side of the con­trol­ler.

To elab­o­rate a bit on that: a con­trol­ler with a search in­side would typ­i­cally have some kind of model of the en­vi­ron­ment, which it uses by search­ing for good ac­tions/​plans/​poli­cies for achiev­ing its goals. So, mea­sur­ing the op­ti­miza­tion power as a con­trol­ler, we look at how suc­cess­ful it is at achiev­ing its goals in the real en­vi­ron­ment. Mea­sur­ing the op­ti­miza­tion power as a se­lec­tor, we look at how good it is at choos­ing high-value op­tions within its world-model. The search can only do as well as its model can tell it; how­ever, in some sense, the agent is ul­ti­mately judged by the true con­se­quences of its ac­tions.

IE, in this case, the se­lec­tion vs con­trol dis­tinc­tion is a map/​ter­ri­tory dis­tinc­tion. I think this is part of why I get so an­noyed at things which mix up se­lec­tion and con­trol: it looks like a map/​ter­ri­tory er­ror to me.

How­ever, this is not the only way se­lec­tion and con­trol com­monly re­late to each other.

Effec­tive con­trol­lers are very of­ten de­signed through a search pro­cess. This might be search tak­ing place within a model, again (for ex­am­ple, train­ing a neu­ral net­work to con­trol a robot, but get­ting its gra­di­ents from a physics simu­la­tion so that you can gen­er­ate a large num­ber of train­ing sam­ples rel­a­tively cheaply) or the real world (evolu­tion by nat­u­ral se­lec­tion, “eval­u­at­ing” ge­netic code by see­ing what sur­vives).

Fur­ther com­pli­cat­ing things, a pow­er­ful search al­gorithm gen­er­ally has some “smarts” to it, ie, it is good at choos­ing what op­tion to eval­u­ate next based on the cur­rent state of things. This “smarts” is con­trol­ler-style smarts: ev­ery choice mat­ters (be­cause ev­ery eval­u­a­tion costs pro­cess­ing power), there’s no back-track­ing, and you have to hit a nar­row tar­get in one shot. (What­ever the tar­get of the un­der­ly­ing search prob­lem, the tar­get of the search-con­trol­ler is: find that tar­get, quickly.) And, of course, it is pos­si­ble that such a search-con­trol­ler will even use a model of the fit­ness land­scape, and plan its next choice via its own search!

(I’m not mak­ing this up as a weird hy­po­thet­i­cal; ac­tual al­gorithms such as es­ti­ma­tion-of-dis­tri­bu­tion al­gorithms will make mod­els of the fit­ness land­scape. For ob­vi­ous rea­sons, search­ing for good points in such mod­els is usu­ally avoided; how­ever, in cases where eval­u­a­tion of points is ex­pen­sive enough, it may be worth it to ex­plic­itly plan out test-points which will re­veal the most in­for­ma­tion about the fit­ness land­scape, so that the best point can be se­lected later.)

Blur­ring the Lines: What’s the Crit­i­cal Distinc­tion?

I men­tioned ear­lier that this di­chotomy seems more like a con­cep­tual cluster than a fully for­mal dis­tinc­tion. I men­tioned a num­ber of big differ­ences which stick out at me. Let’s con­sider some of these in more de­tail.

Perfect Feedback

The clas­si­cal sort of search al­gorithm I de­scribed as my cen­tral ex­am­ple of a se­lec­tion pro­cess in­cludes the abil­ity to get a perfect eval­u­a­tion of any op­tion. The difficulty arises only from the very large num­ber of op­tions available. Selec­tion pro­cesses, on the other hand, ap­pear to have very bad feed­back, since you can’t know the full out­come un­til it is too late to do any­thing about it. Can we use this as our defi­ni­tion?

I would agree that a search pro­cess in which the cost of eval­u­a­tion goes to in­finity be­comes purely a con­trol pro­cess: you can’t perform any fil­ter­ing of pos­si­bil­ities based on eval­u­a­tion, so, you have to out­put one pos­si­bil­ity and try to make it a good one (with no guaran­tees). Maybe you get some in­for­ma­tion about the ob­jec­tive func­tion (like its source code), and you have to try to use that to choose an op­tion. That’s your sen­sors and ac­tu­a­tors. They have to be very clever to achieve very good out­comes. The cheaper it is to eval­u­ate the ob­jec­tive func­tion on ex­am­ples, the less “con­trol” you need (the more you can just do brute-force search). In the op­po­site ex­treme, eval­u­at­ing op­tions is so cheap that you can check all of them, and out­put the max­i­mum di­rectly.

While this is some­what ap­peal­ing, it doesn’t cap­ture ev­ery case. Search al­gorithms to­day (such as stochas­tic gra­di­ent de­scent) of­ten have im­perfect feed­back. Game-tree search deals with an ob­jec­tive func­tion which is much too costly to eval­u­ate di­rectly (the qual­ity of a move), but can be op­ti­mized for nonethe­less by re­cur­sively search­ing for good moves in sub­games down the game tree (mixed with ap­prox­i­mate eval­u­a­tions such as rol­louts or heuris­tic board eval­u­a­tions). I still think of both of these as solidly on the “se­lec­tion pro­cess” side of things.

On the con­trol pro­cess side, it is pos­si­ble to have perfect feed­back with­out do­ing any search. Ther­mostats re­al­is­ti­cally have noisy in­for­ma­tion about the tem­per­a­ture of a room, but, you can imag­ine a case where they get perfect in­for­ma­tion. It isn’t any less a con­trol­ler, or more a se­lec­tion pro­cess, for that fact.

Choices Don’t Change Later Choices

Another fea­ture I men­tioned was that in se­lec­tion pro­cesses, all op­tions are available to try at any time, and what you look at now does not change how good any op­tion will be later. On the other hand, in a con­trol pro­cess, pre­vi­ous choices can to­tally change how good par­tic­u­lar later choices would be (as in re­in­force­ment learn­ing), or change what op­tions are even available (as in game play­ing).

First, let me set two com­pli­ca­tions aside.

  • Weird de­ci­sion the­ory cases: it is the­o­ret­i­cally pos­si­ble to screw with a search by giv­ing it an ob­jec­tive func­tion which de­pends on its choices dur­ing search. This doesn’t seem that in­ter­est­ing for our pur­poses here. (And that’s com­ing from me...)

  • Lo­cal search limits the “op­tions” to small mod­ifi­ca­tions of the op­tion just con­sid­ered. I don’t think this is blur­ring the lines be­tween search and con­trol; rather, it is more like us­ing a con­trol­ler within a smart search to try to in­crease effi­ciency, as I dis­cussed at the end of the pro­cesses-within-pro­cesses sec­tion. All the op­tions are still “available” at all times; the search al­gorithm just hap­pens to be one which limits it­self to con­sid­er­ing a smaller list.

I do think some cases blur the lines here, though. My pri­mary ex­am­ple is the multi-armed ban­dit prob­lem. This is a spe­cial case of the RL prob­lem in which the his­tory doesn’t mat­ter; ev­ery op­tion is equally good ev­ery time, ex­cept for some ran­dom noise. Yet, to me, it is still a con­trol prob­lem. Why? Be­cause ev­ery de­ci­sion mat­ters. The feed­back you get about how good a par­tic­u­lar choice was isn’t just thought of as in­for­ma­tion; you “ac­tu­ally get” the good/​bad out­come each time. That’s the es­sen­tial char­ac­ter of the multi-armed ban­dit prob­lem: you have to trade off be­tween ex­per­i­men­tally try­ing op­tions you’re un­cer­tain about vs stick­ing with the op­tions which seem best so far, be­cause ev­ery se­lec­tion car­ries weight.

This leads me to the next pro­posed defi­ni­tion.

Offline vs Online

Selec­tion pro­cesses are like offline al­gorithms, whereas con­trol pro­cesses are like on­line al­gorithms.

With offline al­gorithms, you only re­ally care about the end re­sults. You are OK run­ning gra­di­ent de­scent for mil­lions of iter­a­tions be­fore it starts do­ing any­thing cool, so long as it even­tu­ally does some­thing cool.

With on­line al­gorithms, you care about each out­come in­di­vi­d­u­ally. You would prob­a­bly not want to be gra­di­ent-de­scent-train­ing a neu­ral net­work in live user-ser­vic­ing code on a web­site, be­cause live code has to be ac­cept­ably good from the start. Even if you can ini­tial­ize the neu­ral net­work to some­thing ac­cept­ably good, you’d hes­i­tate to run stochas­tic gra­di­ent de­scent on it live, be­cause stochas­tic gra­di­ent de­scent can some­times dra­mat­i­cally de­crease perfor­mance for a while be­fore im­prov­ing perfor­mance again.

Fur­ther­more, on­line al­gorithms have to deal with non-sta­tion­ar­ity. This seems suit­ably like a con­trol is­sue.

So, se­lec­tion pro­cesses are “offline op­ti­miza­tion”, whereas con­trol pro­cesses are “on­line op­ti­miza­tion”: op­ti­miz­ing things “as they progress” rather than stat­i­cally. (Note that the no­tion of “on­line op­ti­miza­tion” im­plied by this line of think­ing is slightly differ­ent from the com­mon defi­ni­tion of on­line op­ti­miza­tion, though re­lated.)

The offline vs on­line dis­tinc­tion also has a lot to do with the sorts of mis­takes I think peo­ple are mak­ing when they con­fuse se­lec­tion pro­cesses and con­trol pro­cesses. Re­in­force­ment learn­ing, as a sub­field of AI, was ob­vi­ously mo­ti­vated from a highly on­line per­spec­tive. How­ever, it is very of­ten used as an offline al­gorithm to­day, to pro­duce effec­tive agents, rather than as an effec­tive agent. So, that there’s been some mis­match be­tween the mo­ti­va­tions which shaped the paradigm and ac­tual use. This per­spec­tive made it less sur­pris­ing when black-box op­ti­miza­tion beat re­in­force­ment learn­ing on some prob­lems (see also).

This seems like the best defi­ni­tion so far. How­ever, I per­son­ally still feel like it is still miss­ing some­thing im­por­tant. Selec­tion vs con­trol feels to me like a type dis­tinc­tion, closer to map-vs-ter­ri­tory.

To give an ex­plicit coun­terex­am­ple: evolu­tion by nat­u­ral se­lec­tion is ob­vi­ously a se­lec­tion pro­cess ac­cord­ing to the dis­tinc­tion as I make it, but it seems much more like an on­line al­gorithm than on offline one, if we try to judge it as such.

In­ter­nal Fea­tures vs Context

Re­turn­ing to the defi­ni­tion in mesa-op­ti­miz­ers (em­pha­sis mine):

Whether a sys­tem is an op­ti­mizer is a prop­erty of its in­ter­nal struc­ture—what al­gorithm it is phys­i­cally im­ple­ment­ing—and not a prop­erty of its in­put-out­put be­hav­ior. Im­por­tantly, the fact that a sys­tem’s be­hav­ior re­sults in some ob­jec­tive be­ing max­i­mized does not make the sys­tem an op­ti­mizer.

The no­tion of a se­lec­tion pro­cess says a lot about what is ac­tu­ally hap­pen­ing in­side a se­lec­tion pro­cess: there is a space of op­tions, which can be enu­mer­ated; it is try­ing them; there is some kind of eval­u­a­tion; etc.

The no­tion of con­trol pro­cess, on the other hand, is more ex­ter­nally defined. It doesn’t mat­ter what’s go­ing on in­side of the con­trol­ler. All that mat­ters is how effec­tive it is at what it does.

A se­lec­tion pro­cess—such as a neu­ral net­work learn­ing al­gorithm—can be re­garded “from out­side”, ask­ing ques­tions about how the one out­put of the al­gorithm does in the true en­vi­ron­ment. In fact, this kind of think­ing is what we do when we think about gen­er­al­iza­tion er­ror.

Similarly, we can an­a­lyze a con­trol pro­cess “from in­side”, try­ing to find the pieces which cor­re­spond to be­liefs, goals, plans, and so on (or pos­tu­late what they would look like if they ex­isted—as must be done in the case of con­trol­lers which truly lack such mov­ing parts). This is the de­ci­sion-the­o­retic view.

In this view, se­lec­tion vs con­trol doesn’t re­ally cluster differ­ent types of ob­ject, but rather, differ­ent types of anal­y­sis. To a large ex­tent, we can cluster ob­jects by what kind of anal­y­sis we would more of­ten want to do. How­ever, cer­tain cases (such as a game-play­ing AI) are best viewed through both lenses (as a con­trol­ler, in the con­text of do­ing well in a real game against a hu­man, and as a se­lec­tion pro­cess, when think­ing about the game-tree search).

Over­all, I think I’m prob­a­bly still some­what con­fused about the whole se­lec­tion vs con­trol is­sue, par­tic­u­larly as it per­tains to the ques­tion of how de­ci­sion the­ory can ap­ply to things in the world.