The Credit Assignment Problem

This post is even­tu­ally about par­tial agency. How­ever, it’s been a some­what tricky point for me to con­vey; I take the long route. Epistemic sta­tus: slightly crazy.

I’ve oc­ca­sion­ally said that ev­ery­thing boils down to credit as­sign­ment prob­lems.

One big area which is “ba­si­cally credit as­sign­ment” is mechanism de­sign. Mechanism de­sign is largely about split­ting gains from trade in a way which re­wards co­op­er­a­tive be­hav­ior and pun­ishes un­co­op­er­a­tive be­hav­ior. Many prob­lems are partly about mechanism de­sign:

  • Build­ing func­tional or­ga­ni­za­tions;

  • De­sign­ing mar­kets to solve prob­lems (such as pre­dic­tion mar­kets, or kid­ney-trans­plant trade pro­grams);

  • Law, and law en­force­ment;

  • Prac­ti­cal co­or­di­na­tion prob­lems, such as split­ting rent;

  • So­cial norms gen­er­ally;

  • Philo­soph­i­cal is­sues in ethics/​moral­ity (jus­tice, fair­ness, con­trac­tu­al­ism, is­sues in util­i­tar­i­anism).

Another big area which I claim as “ba­si­cally credit as­sign­ment” (per­haps more con­tro­ver­sially) is ar­tifi­cial in­tel­li­gence.

In the 1970s, John Hol­land kicked off the in­ves­ti­ga­tion of learn­ing clas­sifier sys­tems. John Hol­land had re­cently in­vented the Ge­netic Al­gorithms paradigm, which ap­plies an evolu­tion­ary paradigm to op­ti­miza­tion prob­lems. Clas­sifier sys­tems were his at­tempt to ap­ply this kind of “adap­tive” paradigm (as in “com­plex adap­tive sys­tems”) to cog­ni­tion. Clas­sifier sys­tems added an eco­nomic metaphor to the evolu­tion­ary one; lit­tle bits of thought paid each other for ser­vices ren­dered. The hope was that a com­plex ecol­ogy+econ­omy could de­velop, solv­ing difficult prob­lems.

One of the main de­sign fea­tures on which clas­sifier sys­tems differ is on de­tails of the vir­tual econ­omy—that is, the credit as­sign­ment al­gorithm. An early pro­posal was the bucket-brigade al­gorithm. Re­ward is as­signed to cog­ni­tive pro­ce­dures which pro­duce good out­puts. Th­ese pro­ce­dures pass re­ward back to the pro­ce­dures which ac­ti­vated them, who similarly pass re­ward back in turn. This way, the econ­omy sup­ports chains of use­ful pro­ce­dures.

Un­for­tu­nately, the bucket-brigade al­gorithm was vuln­er­a­ble to par­a­sites. Mal­ign cog­ni­tive pro­ce­dures could gain wealth by ac­ti­vat­ing use­ful pro­ce­dures with­out re­ally con­tribut­ing any­thing. This prob­lem proved difficult to solve. Tak­ing the econ­omy anal­ogy se­ri­ously, we might want cog­ni­tive pro­ce­dures to de­cide in­tel­li­gently who to pay for ser­vices. But, these are sup­posed to be itty bitty frag­ments of our thought pro­cess. De­cid­ing how to pass along credit is a very com­plex task. Hence the need for a pre-speci­fied solu­tion such as bucket-brigade.

The difficulty of the credit as­sign­ment prob­lem lead to a split in the field. Ken­neth de Jong and Stephanie Smith founded a new ap­proach, “Pitts­burgh style” clas­sifier sys­tems. John Hol­land’s origi­nal vi­sion be­came “Michi­gan style”.

Pitts­burgh style clas­sifier sys­tems evolve the en­tire set of rules, rather than try­ing to as­sign credit lo­cally. A set of rules will stand or fall to­gether, based on over­all perfor­mance. This aban­doned John Hol­land’s origi­nal fo­cus on on­line learn­ing. Essen­tially, the Pitts­burgh camp went back to plain ge­netic al­gorithms, albeit with a spe­cial rep­re­sen­ta­tion.

(I’ve been hav­ing some dis­agree­ments with Ofer, in which Ofer sug­gests that ge­netic al­gorithms are rele­vant to my re­cent thoughts on par­tial agency, and I ob­ject on the grounds that the phe­nom­ena I’m in­ter­ested in have to do with on­line learn­ing, rather than offline. In my imag­i­na­tion, ar­gu­ments be­tween the Michi­gan and Pitts­burgh camps would have similar con­tent.)

Ok. That was then, this is now. Every­one uses gra­di­ent de­scent these days. What’s the point of bring­ing up a three-decade-old de­bate about ob­so­lete paradigms in AI?

What Is Credit As­sign­ment?

I’ve said that clas­sifier sys­tems faced a credit as­sign­ment prob­lem. What does that mean, ex­actly?

The defi­ni­tion I want to use for this es­say is:

  • You’re en­gaged in some kind of task;

  • you use some kind of struc­tured strat­egy (such as a neu­ral net­work, or a pro­gram, or a set of peo­ple);

  • you re­ceive some kind of feed­back about how well you did;

  • you want to figure out how to use that feed­back to im­prove your strat­egy.

So, credit as­sign­ment is the prob­lem of turn­ing feed­back into strat­egy im­prove­ments.

The bucket-brigade al­gorithm tried to do this lo­cally, mean­ing, in­di­vi­d­ual itty-bitty pieces get pos­i­tive/​nega­tive credit. In the light of his­tory, we could say that the Michi­gan/​Pitts­burgh dis­tinc­tion con­flated lo­cal-vs-global search with on­line-vs-offline. There’s no nec­es­sary con­nec­tion be­tween those two; on­line learn­ing is com­pat­i­ble with as­sign­ment of lo­cal credit.

In prac­tice, two big in­no­va­tions made the Michi­gan/​Pitts­burgh de­bate ob­so­lete: back­prop, and Q-learn­ing. Back­prop turned global feed­back into lo­cal. Q-learn­ing pro­vided a way to as­sign credit in on­line con­texts.

I think peo­ple gen­er­ally un­der­stand the con­tri­bu­tion of back­prop and its im­por­tance. Back­prop is es­sen­tially the cor­rect ver­sion of what bucket-brigade was overtly try­ing to do: pass credit back along chains. Bucket-brigade wasn’t quite right in how it did this, but back­prop cor­rects the prob­lems.

So what’s the im­por­tance of Q-learn­ing? I want to dis­cuss that in more de­tail.

The Con­cep­tual Difficulty of ‘On­line Search’

In on­line learn­ing, you are re­peat­edly pro­duc­ing out­puts of some kind (call them “ac­tions”) while re­peat­edly get­ting feed­back of some kind (call it “re­ward”). But, you don’t know how to as­so­ci­ate par­tic­u­lar ac­tions (or com­bi­na­tions of ac­tions) with par­tic­u­lar re­wards. I might take the crit­i­cal ac­tion at time 12, and not see the pay­off un­til time 32.

In offline learn­ing, you can solve this with a sledge­ham­mer: you can take the to­tal re­ward over ev­ery­thing, with one fixed in­ter­nal ar­chi­tec­ture. You can try out differ­ent in­ter­nal ar­chi­tec­tures and see how well each do. (This may be far from the most effi­cient way of do­ing things, even in the offline case; but, you can do it.)

Ba­si­cally, in offline learn­ing, you have a func­tion you can op­ti­mize. In on­line learn­ing, you don’t.

Back­prop is just a com­pu­ta­tion­ally effi­cient way to do hill­climb­ing search, where we re­peat­edly look for small steps which im­prove the over­all fit­ness. But how do you do this if you don’t have a fit­ness func­tion?

Q-learn­ing and other re­in­force­ment learn­ing tech­niques provide a way to define the equiv­a­lent of a fit­ness func­tion for on­line prob­lems, so that you can learn.

Models to the Rescue

How do you solve the ap­proach of as­so­ci­at­ing re­wards with ac­tions?

I’m go­ing to make a bold claim: you can’t solve the ac­tion/​re­ward match­ing prob­lem with­out some kind of model.

For ex­am­ple, if we make an epi­sodic as­sump­tion, we can as­sign re­wards within an epi­sode bound­ary to the ac­tions within that same epi­sode bound­ary.

Q-learn­ing makes an as­sump­tion that the state is fully ob­serv­able, amongst other as­sump­tions.

Nat­u­rally, we would like to re­duce the strengths of the as­sump­tions we have to make as much as we can. One way is to look at in­creas­ingly rich model classes. AIXI uses all com­putable mod­els. But maybe “all com­putable mod­els” is still too re­stric­tive; we’d like to get re­sults with­out as­sum­ing a grain of truth. (That’s why I am not re­ally dis­cussing Bayesian mod­els much in this post; I don’t want to as­sume a grain of truth..) So we back off even fur­ther, and use log­i­cal in­duc­tion. Ok, sure.

But wouldn’t the best way be to try to learn with­out mod­els at all? That way, we re­duce our “mod­el­ing as­sump­tions” to zero.

After all, there’s some­thing in ma­chine learn­ing called “model free learn­ing”, right?

Here’s where my bold claim comes in: I’m claiming that even “model free” meth­ods ac­tu­ally have a “model” of sorts.

How does model-free learn­ing work? Well, of­ten you work with a simu­la­ble en­vi­ron­ment, which means you can es­ti­mate the qual­ity of a policy by run­ning it many times, and use al­gorithms such as policy-gra­di­ent to learn. This is called “model free learn­ing” be­cause the learn­ing part of the al­gorithm doesn’t try to pre­dict the con­se­quences of ac­tions; you’re just learn­ing which ac­tion to take. From our per­spec­tive here, though, this is 100% cheat­ing; you can only learn be­cause you have a good model of the en­vi­ron­ment.

A more gen­eral ap­proach to model-free learn­ing is ac­tor-critic learn­ing. The “ac­tor” is the policy we are learn­ing. The “critic” is a learned es­ti­mate of how good things are look­ing given the his­tory. IE, we learn to es­ti­mate the ex­pected value—not just the next re­ward, but the to­tal fu­ture dis­counted re­ward.

Un­like the re­ward, the ex­pected value solves the credit as­sign­ment for us. Imag­ine we can see the “true” ex­pected value. If we take an ac­tion and then the ex­pected value in­creases, we know the ac­tion was good (in ex­pec­ta­tion). If we take an ac­tion and ex­pected value de­creases, we know it was bad (in ex­pec­ta­tion).

So, ac­tor-critic works by (1) learn­ing to es­ti­mate the ex­pected value; (2) us­ing the cur­rent es­ti­mated ex­pected value to give feed­back to learn a policy.

What I want to point out here is that the critic still has “model” fla­vor. Ac­tor-critic is called “model-free” be­cause noth­ing is ex­plic­itly trained to an­ti­ci­pate the sen­sory ob­ser­va­tions, or the world-state. How­ever, the critic is learn­ing to pre­dict; it’s just that all we need to pre­dict is ex­pected value.

Where Up­dates Come From

Here be­gins the cra­zier part of this post. This is all in­tu­itive/​con­jec­tural.

Claim: in or­der to learn, you need to ob­tain an “up­date”/​”gra­di­ent”, which is a di­rec­tion (and mag­ni­tude) you can shift in which is more likely than not an im­prove­ment.

Claim: pre­dic­tive learn­ing gets gra­di­ents “for free”—you know that you want to pre­dict things as ac­cu­rately as you can, so you move in the di­rec­tion of what­ever you see. With Bayesian meth­ods, you in­crease the weight of hy­pothe­ses which would have pre­dicted what you saw; with gra­di­ent-based meth­ods, you get a gra­di­ent in the di­rec­tion of what you saw (and away from what you didn’t see).

Claim: if you’re learn­ing to act, you do not similarly get gra­di­ents “for free”. You take an ac­tion, and you see re­sults of that one ac­tion. This means you fun­da­men­tally don’t know what would have hap­pened had you taken al­ter­nate ac­tions, which means you don’t have a di­rec­tion to move your policy in. You don’t know whether al­ter­na­tives would have been bet­ter or worse. So, re­wards you ob­serve seem like not enough to de­ter­mine how you should learn.

Claim: you have to get gra­di­ents from a source that already has gra­di­ents. We saw that model-free learn­ing works by split­ting up the task into (1) learn­ing to an­ti­ci­pate ex­pected value; (2) learn­ing a good policy via the gra­di­ents we can get from (1).

What it means for a learn­ing prob­lem to “have gra­di­ents” is just that the feed­back you get tells you how to learn. Pre­dic­tive learn­ing prob­lems (su­per­vised or un­su­per­vised) have this; they can just move to­ward what’s ob­served. Offline prob­lems have this; you can define one big func­tion which you’re try­ing to op­ti­mize. Learn­ing to act on­line doesn’t have this, how­ever, be­cause it lacks coun­ter­fac­tu­als.

The Gra­di­ent Gap

(I’m go­ing to keep us­ing the terms ‘gra­di­ent’ and ‘up­date’ in a more or less in­ter­change­able way here; this is at a level of ab­strac­tion where there’s not a big dis­tinc­tion.)

I’m go­ing to call the “prob­lem” the gra­di­ent gap. I want to call it a prob­lem, even though we know how to “close the gap” via pre­dic­tive learn­ing (whether model-free or model-based). The is­sue with this solu­tion is only that it doesn’t feel el­e­gant. It’s weird that you have to run two differ­ent back­prop up­dates (or what­ever learn­ing pro­ce­dures you use); one for the pre­dic­tive com­po­nent, and an­other for the policy. It’s weird that you can’t “di­rectly” use feed­back to learn to act.

Why should we be in­ter­ested in this “prob­lem”? After all, this is a ba­sic point in de­ci­sion the­ory: to max­i­mize util­ity un­der un­cer­tainty, you need prob­a­bil­ity.

One part of it is that I want to scrap clas­si­cal (“static”) de­ci­sion the­ory and move to a more learn­ing-the­o­retic (“dy­namic”) view. In both AIXI and log­i­cal-in­duc­tion based de­ci­sion the­o­ries, we get a nice learn­ing-the­o­retic foun­da­tion for the epistemics (solomonoff in­duc­tion/​log­i­cal in­duc­tion), but, we tack on a non-learn­ing de­ci­sion-mak­ing unit on top. I have be­come skep­ti­cal of this ap­proach. It puts the learn­ing into a nice lit­tle box la­beled “epistemics” and then tries to make a de­ci­sion based on the un­cer­tainty which comes out of the box. I think maybe we need to learn to act in a more fun­da­men­tal fash­ion.

A symp­tom of this, I hy­poth­e­size, is that AIXI and log­i­cal in­duc­tion DT don’t have very good learn­ing-the­o­retic prop­er­ties. [AIXI’s learn­ing prob­lems; LIDT’s learn­ing prob­lems.] You can’t say very much to recom­mend the poli­cies they learn, ex­cept that they’re op­ti­mal ac­cord­ing to the be­liefs of the epistemics box—a fairly triv­ial state­ment, given that that’s how you de­cide what ac­tion to take in the first place.

Now, in clas­si­cal de­ci­sion the­ory, there’s a nice pic­ture where the need for epistemics emerges nicely from the de­sire to max­i­mize util­ity. The com­plete class the­o­rem starts with rad­i­cal un­cer­tainty (ie, non-quan­ti­ta­tive), and de­rives prob­a­bil­ities from a will­ing­ness to take pareto im­prove­ments. That’s great! I can tell you why you should have be­liefs, on prag­matic grounds! What we seem to have in ma­chine learn­ing is a less nice pic­ture, in which we need epistemics in or­der to get off the ground, but can’t jus­tify the re­sults with­out cir­cu­lar re­li­ance on epistemics.

So the gap is a real is­sue—it means that we can have nice learn­ing the­ory when learn­ing to pre­dict, but we lack nice re­sults when learn­ing to act.

This is the ba­sic prob­lem of credit as­sign­ment. Evolv­ing a com­plex sys­tem, you can’t de­ter­mine which parts to give credit to suc­cess/​failure (to de­cide what to tweak) with­out a model. But the model is bound to be a lot of the in­ter­est­ing part! So we run into big prob­lems, be­cause we need “in­ter­est­ing” com­pu­ta­tions in or­der to eval­u­ate the prag­matic qual­ity/​value of com­pu­ta­tions, but we can’t get in­ter­est­ing com­pu­ta­tions to get our­selves started, so we need to learn...

Essen­tially, we seem doomed to run on a strat­ified credit as­sign­ment sys­tem, where we have an “in­cor­rupt­ible” epistemic sys­tem (which we can learn be­cause we get those gra­di­ents “for free”). We then use this to define gra­di­ents for the in­stru­men­tal part.

A strat­ified sys­tem is dis­satis­fy­ing, and im­prac­ti­cal. First, we’d pre­fer a more unified view of learn­ing. It’s just kind of weird that we need the two parts. Se­cond, there’s an ob­sta­cle to prag­matic/​prac­ti­cal con­sid­er­a­tions en­ter­ing into epistemics. We need to fo­cus on pre­dict­ing im­por­tant things; we need to con­trol the amount of pro­cess­ing power spent; things in that vein. But (on the two-level view) we can’t al­low in­stru­men­tal con­cerns to con­tam­i­nate epistemics! We risk cor­rup­tion! As we saw with bucket-brigade, it’s easy for credit as­sign­ment sys­tems to al­low par­a­sites which de­stroy learn­ing.

A more unified credit as­sign­ment sys­tem would al­low those things to be han­dled nat­u­rally, with­out split­ting into two lev­els; as things stand, any in­volve­ment of prag­matic con­cerns in epistemics risks the vi­a­bil­ity of the whole sys­tem.

Tiling Con­cerns & Full Agency

From the per­spec­tive of full agency (ie, the nega­tion of par­tial agency), a sys­tem which needs a pro­tected epistemic layer sounds sus­pi­ciously like a sys­tem that can’t tile. You look at the world, and you say: “how can I max­i­mize util­ity?” You look at your be­liefs, and you say: “how can I max­i­mize ac­cu­racy?” That’s not a con­se­quen­tial­ist agent; that’s two differ­ent con­se­quen­tial­ist agents! There can only be one king on the chess­board; you can only serve one mas­ter; etc.

If it turned out we re­ally re­ally need two-level sys­tems to get full agency, this would be a pretty weird situ­a­tion. “Agency” would seem to be only an illu­sion which can only be main­tained by crip­pling agents and giv­ing them a split-brain ar­chi­tec­ture where an in­stru­men­tal task-mon­key does all the im­por­tant stuff while an epistemic over­seer su­per­vises. An agent which “breaks free” would then free it­self of the struc­ture which al­lowed it to be an agent in the first place.

On the other hand, from a par­tial-agency per­spec­tive, this kind of ar­chi­tec­ture could be perfectly nat­u­ral. IE, if you have a learn­ing scheme from which to­tal agency doesn’t nat­u­rally emerge, then there isn’t any fun­da­men­tal con­tra­dic­tion in set­ting up a sys­tem like this.


Part of the (po­ten­tially crazy) claim here is that hav­ing mod­els always gives rise to some form of my­opia. Even log­i­cal in­duc­tion, which seems quite un­re­stric­tive, makes LIDT fail prob­lems such as ASP, mak­ing it my­opic ac­cord­ing to the sec­ond defi­ni­tion of my pre­vi­ous post. (We can patch this with LI policy se­lec­tion, but for any par­tic­u­lar ver­sion of policy se­lec­tion, we can come up with de­ci­sion prob­lems for which it is “not up­date­less enough”.) You could say it’s my­opic “across log­i­cal time”, what­ever that means.

If it were true that “learn­ing always re­quires a model” (in the sense that learn­ing-to-act always re­quires ei­ther learn­ing-to-pre­dict or hard-coded pre­dic­tions), and if it were true that “mod­els always give rise to some form of my­opia”, then this would con­firm my con­jec­ture in the pre­vi­ous post (that no learn­ing scheme in­cen­tivises full agency).

This is all pretty out there; I’m not say­ing I be­lieve this with high prob­a­bil­ity.

Evolu­tion & Evolved Agents

Evolu­tion is a coun­terex­am­ple to this view: evolu­tion learns the policy “di­rectly” in es­sen­tially the way I want. This is pos­si­ble be­cause evolu­tion “gets the gra­di­ents for free” just like pre­dic­tive learn­ing does: the “gra­di­ent” here is just the ac­tual re­pro­duc­tive suc­cess of each genome.

Un­for­tu­nately, we can’t just copy this trick. Ar­tifi­cial evolu­tion re­quires that we de­cide how to kill off /​ re­pro­duce things, in the same way that an­i­mal breed­ing re­quires breed­ers to de­cide what they’re op­ti­miz­ing for. This puts us back at square one; IE, need­ing to get our gra­di­ent from some­where else.

Does this mean the “gra­di­ent gap” is a prob­lem only for ar­tifi­cial in­tel­li­gence, not for nat­u­ral agents? No. If it’s true that learn­ing to act re­quires a 2-level sys­tem, then evolved agents would need a 2-level sys­tem in or­der to learn within their lifes­pan; they can’t di­rectly use the gra­di­ent from evolu­tion, since it re­quires them to die.

Also, note that evolu­tion seems my­opic. (This seems com­pli­cated, so I don’t want to get into pin­ning down ex­actly in which senses evolu­tion is my­opic here.) So, the case of evolu­tion seems com­pat­i­ble with the idea that any gra­di­ents we can ac­tu­ally get are go­ing to in­cen­tivize my­opic solu­tions.

Similar com­ments ap­ply to mar­kets vs firms.