A Crash Course in the Neuroscience of Human Motivation

[PDF of this ar­ti­cle up­dated Aug. 23, 2011]

[skip to pref­ace]

When­ever I write a new ar­ti­cle for Less Wrong, I’m pul­led in two op­po­site di­rec­tions.

One force pulls me to­ward writ­ing short, ex­cit­ing posts with lots of brain candy and just one main point. Eliezer has done that kind of thing very well many times: see Mak­ing Beliefs Pay Rent, Hind­sight De­val­ues Science, Prob­a­bil­ity is in the Mind, Ta­boo Your Words, Mind Pro­jec­tion Fal­lacy, Guess­ing the Teacher’s Pass­word, Hold Off on Propos­ing Solu­tions, Ap­plause Lights, Dis­solv­ing the Ques­tion, and many more.

Another force pulls me to­ward writ­ing long, fac­tu­ally dense posts that fill in as many of the pieces of a par­tic­u­lar ar­gu­ment in one fell swoop as pos­si­ble. This is largely be­cause I want to write about the cut­ting edge of hu­man knowl­edge but I keep re­al­iz­ing that the in­fer­en­tial gap is larger than I had an­ti­ci­pated, and I want to fill in that in­fer­en­tial gap quickly so I can get to the cut­ting edge.

For ex­am­ple, I had to draw on dozens of Eliezer’s posts just to say I was head­ing to­ward my metaethics se­quence. I’ve also pub­lished 21 new posts (many of them quite long and heav­ily re­searched) writ­ten speci­fi­cally be­cause I need to re­fer to them in my metaethics se­quence.1 I tried to make these posts in­ter­est­ing and use­ful on their own, but my pri­mary mo­ti­va­tion for writ­ing them was that I need them for my metaethics se­quence.

And now I’ve writ­ten only four posts2 in my metaethics se­quence and already the in­fer­en­tial gap to my next post in that se­quence is huge again. :(

So I’d like to try an ex­per­i­ment. I won’t do it of­ten, but I want to try it at least once. In­stead of writ­ing 20 more short posts be­tween now and the next post in my metaethics se­quence, I’ll at­tempt to fill in a big chunk of the in­fer­en­tial gap to my next metaethics post in one fell swoop by writ­ing a long tu­to­rial post (a la Eliezer’s tu­to­ri­als on Bayes’ The­o­rem and tech­ni­cal ex­pla­na­tion).3

So if you’re not up for a 20-page tu­to­rial on hu­man mo­ti­va­tion, this post isn’t for you, but I hope you’re glad I both­ered to write it for the sake of oth­ers. If you are in the mood for a 20-page tu­to­rial on hu­man mo­ti­va­tion, please pro­ceed.

Who knows what I want to do? Who knows what any­one wants to do? How can you be sure about some­thing like that? Isn’t it all a ques­tion of brain chem­istry, sig­nals go­ing back and forth, elec­tri­cal en­ergy in the cor­tex? How do you know whether some­thing is re­ally what you want to do or just some kind of nerve im­pulse in the brain. Some minor lit­tle ac­tivity takes place some­where in this unim­por­tant place in one of the brain hemi­spheres and sud­denly I want to go to Mon­tana or I don’t want to go to Mon­tana.

- Don DeLillo, White Noise


How do we value things, and choose be­tween op­tions? Philoso­phers, economists, and psy­chol­o­gists have long tried to an­swer these ques­tions. But hu­man be­hav­ior con­tinues to defy our most sub­tle mod­els of it, and the al­gorithms pro­duc­ing our be­hav­ior re­mained hid­den in a black box.

But now, neu­ro­scien­tists are di­rectly mea­sur­ing the neu­rons whose firing rates en­code value and pro­duce our choices. We know a lot more about the neu­ro­science of hu­man mo­ti­va­tion than you might think. Now we can peer di­rectly into the black box of hu­man mo­ti­va­tion, and be­gin (dimly) to read our own source code.

The neu­ro­science of hu­man mo­ti­va­tion has im­pli­ca­tions for philos­o­phy of mind and ac­tion, for sci­en­tific self-help, and for metaethics and Friendly AI. (We don’t re­ally know what we want, and look­ing di­rectly at the al­gorithms that pro­duce hu­man want­ing might help in solv­ing this mys­tery.)

So, I wrote a crash course in the neu­ro­science of hu­man mo­ti­va­tion.

The pur­pose of this doc­u­ment is not to ar­gue for any of the con­clu­sions pre­sented within it. That would re­quire not a long blog post but in­stead a cou­ple 500-page books — say, Foun­da­tions of Neu­roe­co­nomic Anal­y­sis and Hand­book of Re­ward and De­ci­sion Mak­ing (my two great­est sources for this post).4

In­stead, I merely want to sum­ma­rize the cur­rent main­stream sci­en­tific pic­ture on the neu­ro­science of hu­man mo­ti­va­tion, ex­plain some of the con­cepts it uses, and tell a few sto­ries about how our cur­rent pic­ture of hu­man mo­ti­va­tion de­vel­oped.

As you read this, I hope that many ques­tions and ob­jec­tions will come to mind, be­cause it’s not the full story. That’s why I went to the trou­ble of link­ing to PDFs of al­most all my sources (see Refer­ences): so you can check the full data and the full ar­gu­ments your­self if you like.

This doc­u­ment is long. You may pre­fer to read it in sec­tions.


  1. Folk Psychology

  2. Neo­clas­si­cal Economics

  3. Be­hav­iorism and Re­in­force­ment Learning

  4. Re­in­force­ment Learn­ing and De­ci­sion Theory

  5. The Turn to the Brain

  6. Heb­bian Learning

  7. Ex­pected Utility in Neurons

  8. Real-Time Up­dates to Ex­pected Utility

  9. Argmax and Reser­va­tion Price

  10. Ran­dom Utility

  11. Discounting

  12. Rel­a­tive and Ab­solute Utility

  13. Normalization

  14. Are Ac­tions Choices?

  15. The Pri­mate Choice Mechanism: A Brief Review

  16. Marginal Utility and Refer­ence Dependence

  17. Valu­a­tion in the Brain

  18. Sum­mary and Re­search Directions

Folk Psychology

There are these things called ‘hu­mans’ on planet Earth. They un­dergo metabolism and cell growth. They pro­duce waste. They main­tain home­osta­sis. They re­pro­duce. They move. They com­mu­ni­cate. Some­times they have pillow fights.

Some of these hu­man pro­cesses are ‘au­to­matic’, like cell growth and breath­ing. Other pro­cesses are ‘in­ten­tional’ or ‘willed’, like mov­ing and com­mu­ni­cat­ing and hav­ing pillow fights. We call these lat­ter pro­cesses in­ten­tional ac­tions, or sim­ply ac­tions. Some­times we’re not sure where to draw the line be­tween au­to­matic pro­cesses and ac­tions, but this should be­come clearer as we learn more. In the mean­time, we ask...

How can we ex­plain hu­man ac­tions?

One pop­u­lar ex­pla­na­tion is ‘folk psy­chol­ogy.’ Folk psy­chol­ogy posits that we hu­mans have be­liefs and de­sires, and that we are mo­ti­vated to do what we be­lieve will fulfill our de­sires.

I de­sire to eat a cookie. I be­lieve I can fulfill that de­sire if I walk to the kitchen and put one of the cook­ies there into my mouth. So I am mo­ti­vated to walk to the kitchen and put a cookie in my mouth.

Of course there are com­pli­ca­tions. For ex­am­ple I have mul­ti­ple de­sires. Sup­pose I de­sire to eat a cookie and be­lieve there are cook­ies in the kitchen. But I also de­sire to re­main sit­ting com­fortably in the liv­ing room. Can I satisfy both de­sires? I also be­lieve that if I nicely ask my friend in the kitchen to bring me a cookie, she will. So I ask her to bring me a cookie and I be­gin to eat it, with­out hav­ing to leave the comfy liv­ing room sofa. We still ex­plain my be­hav­ior with con­structs like ‘be­liefs’ and ‘de­sires’, but we con­sider more than one of each to do so.

Most of us use folk psy­chol­ogy ev­ery day to suc­cess­fully pre­dict hu­man be­hav­ior. I be­lieve that my friend de­sires to do nice things for me on oc­ca­sion if they’re not too much trou­ble, and I be­lieve that my friend, once I tell her I want a cookie, will be­lieve she can be nice to me with­out much trou­ble if she brings me a cookie from the kitchen. So, I pre­dict that my friend will bring me a cookie when I ask her. So I ask her, and be­hold! My pre­dic­tion was cor­rect. I am hap­pily eat­ing a cookie on the sofa.

But folk psy­chol­ogy (FP) faces some prob­lems.5 Con­sider its con­text in his­tory:

The pre­sumed do­main of FP used to be much larger than it is now. In prim­i­tive cul­tures, the be­hav­ior of most of the el­e­ments of na­ture were un­der­stood in in­ten­tional terms. The wind could know anger, the moon jeal­ousy, the river gen­eros­ity… Th­ese were not metaphors… the an­i­mistic ap­proach to na­ture has dom­i­nated our his­tory, and it is only in the last two or three thou­sand years that we have re­stricted FP’s literal ap­pli­ca­tion to the do­main of the higher an­i­mals.

[Even still,] the FP of the Greeks is es­sen­tially the FP we uses to­day… This is a very long pe­riod of stag­na­tion and in­fer­til­ity for any the­ory to dis­play, es­pe­cially when faced with such an enor­mous back­log of anoma­lies and mys­ter­ies in its own ex­plana­tory do­main… To use Imre Lakatos’ terms, FP is a stag­nant or de­gen­er­at­ing re­search pro­gram, and has been for mil­len­nia.

Con­sider also its prospects for in­ter-the­o­retic re­duc­tion:

If we ap­proach homo sapi­ens from the per­spec­tive of nat­u­ral his­tory and the phys­i­cal sci­ences, we can tell a co­her­ent story of its con­sti­tu­tion, de­vel­op­ment, and be­hav­ioral ca­pac­i­ties which en­com­passes par­ti­cle physics, atomic and molec­u­lar the­ory, or­ganic chem­istry, evolu­tion­ary the­ory, biol­ogy, phys­iol­ogy, and ma­te­ri­al­is­tic neu­ro­science. The story, though still rad­i­cally in­com­plete, is already ex­tremely pow­er­ful, out­perform­ing FP at many points even in its own do­main. And it is de­liber­ately… co­her­ent with the rest of our de­vel­op­ing world pic­ture. In short, the great­est the­o­ret­i­cal syn­the­sis in [his­tory] is cur­rently in our hands…

But FP is no part of this grow­ing syn­the­sis. Its in­ten­tional cat­e­gories stand mag­nifi­cently alone, with­out visi­ble prospect of re­duc­tion to that larger cor­pus. A suc­cess­ful re­duc­tion can­not be ruled out, in my view, but FP’s ex­plana­tory im­po­tence and long stag­na­tion in­spire lit­tle faith that its cat­e­gories will find them­selves neatly re­flected in the frame­work of neu­ro­science. On the con­trary, one is re­minded of how alchemy must have looked as el­e­men­tal chem­istry was tak­ing form, how Aris­tote­lean cos­mol­ogy must have looked as clas­si­cal me­chan­ics was be­ing ar­tic­u­lated, or how the vi­tal­ist con­cep­tion of life must have looked as or­ganic chem­istry marched for­ward.

Fi­nally, con­sider the prob­lem of habit. I sit at my com­puter and want to type my name, ‘Luke.’ How­ever, I have just used a spe­cial pro­gram to switch the func­tion of the keys la­beled L and P so that they will in­put the other char­ac­ter in­stead (so that I can play a prank on my friend, who will be us­ing my com­puter shortly). I be­lieve that typ­ing the key la­beled L will in­put P in­stead, but nev­er­the­less when I type my name my fingers fall into their fa­mil­iar habit and I end up typ­ing my name as ‘Puke.’ My act of typ­ing was in­ten­tional, and yet I didn’t do what I be­lieved would fulfill my de­sire to type my name.

Folk psy­chol­ogy faces both suc­cesses and failures in ex­plain­ing hu­man ac­tion. Hope­fully we can do bet­ter.

Neo­clas­si­cal Economics

Folk psy­chol­ogy was up­dated and quan­tified by neo­clas­si­cal eco­nomics. To sum­ma­rize:

One [as­sump­tion of] neo­clas­si­cal eco­nomics is “ra­tio­nal­ity,” in which in­di­vi­d­u­als are said to choose al­ter­na­tives that max­i­mize ex­pected util­ities. In par­tic­u­lar, the neo­clas­si­cal view is that in­di­vi­d­u­als rank all pos­si­ble al­ter­na­tives ac­cord­ing to how much satis­fac­tion they will bring and then choose the al­ter­na­tive that [they ex­pect] will bring the most satis­fac­tion or util­ity...6

Let’s re­view this no­tion of max­i­miz­ing ex­pected util­ity. Sup­pose I can choose one of two boxes sit­ting be­fore me, red and blue. There is a 10% chance the red box con­tains a mil­lion dol­lars, and a 90% chance it con­tains noth­ing. As for the blue box, I am cer­tain it con­tains $10,000. The ‘ex­pected value’ of choos­ing the red box is (0.1 × $1,000,000) + (0.9 × $0), which is equal to $100,000. The ex­pected value of choos­ing the blue box is !1 × $10,000), or $10,000. An agent that chose what­ever had the high­est ex­pected value would choose the red box, which has 10 times the ex­pected value of the blue box ($100,000 vs. $10,000).

But hu­mans don’t value things only ac­cord­ing to their dol­lar value. A mil­lion dol­lars might have 10 times the ob­jec­tive value of $100,000, but it might have less than 10 times the sub­jec­tive value of $100,000 be­cause af­ter $100,000 you only care a lit­tle how much more wealthy you are.

Or, you might be risk averse. You might pre­fer a sure thing to some­thing that is un­cer­tain. So a 10% chance of a mil­lion dol­lars might be worth less — in sub­jec­tive value — than a 100% chance of $10,000. If you are risk averse you might choose the blue box be­cause it has higher ex­pected sub­jec­tive value even though it has lower ex­pected ob­jec­tive value.

We call ob­jec­tive value sim­ply ‘value’. We call sub­jec­tive value ‘util­ity.’

Neo­clas­si­cal eco­nomics quan­tifies folk psy­chol­ogy by mea­sur­ing the strength of be­lief with prob­a­bil­ity and by mea­sur­ing the strength of de­sire with util­ity. It then says that hu­mans act so as to max­i­mize ex­pected util­ity, a mea­sure that com­bines the util­ity of par­tic­u­lar thing with your sub­jec­tive prob­a­bil­ity of get­ting it.7

This neo­clas­si­cal model of hu­man be­hav­ior has faced many challenges, and is reg­u­larly re­vised in the face of new ev­i­dence.8 For ex­am­ple, Loewen­stein (1987) found that if stu­dents were asked to place a value on the op­por­tu­nity to kiss a celebrity of their choice 1-5 days in the fu­ture, they placed the high­est value on a kiss in 3 days. This didn’t fit any ex­ist­ing neo­clas­si­cal mod­els of util­ity, but was ex­plained in 2001 when Caplin & Leahy (2001) in­cor­po­rated “an­ti­ci­pa­tory feel­ing” into the neo­clas­si­cal model, ex­plain­ing that the stu­dents got some util­ity from an­ti­ci­pat­ing the kiss with the celebrity (but also, as usual, dis­counted the util­ity of a re­ward the fur­ther away it was in the fu­ture), and this is why they didn’t want the kiss right away.

Keep in mind that economists don’t ar­gue that we ac­tu­ally com­pute the ex­pected util­ity of each op­tion be­fore us and then choose the best one, but that we always act “as if” we were do­ing that.9

But some­times we don’t even act “as if” we are obey­ing the ax­ioms of neo­clas­si­cal eco­nomics. For ex­am­ple, the in­de­pen­dence ax­iom of ex­pected util­ity the­ory says that if you pre­fer an ap­ple over an or­ange, then you must pre­fer the Gam­ble A (72% chance you get an ap­ple, oth­er­wise you get a cat) over the Gam­ble B (72% chance you get an or­ange, oth­er­wise you get a cat). But Allais (1953) found that sub­jects do vi­o­late this ba­sic as­sump­tion un­der some con­di­tions.

Such vi­o­la­tions of the ba­sic ax­ioms of neo­clas­si­cal eco­nomics led to the de­vel­op­ment of be­hav­ioral eco­nomics and the­o­ries like Kah­ne­man and Tver­sky’s (1979) prospect the­ory,10 which tran­scends some as­sump­tions of the neo­clas­si­cal model. But these new the­o­ries don’t fit the data perfectly, ei­ther.11

The mod­els of hu­man mo­ti­va­tion we’ve sur­veyed so far are con­cep­tu­ally re­lated to de­ci­sion the­ory (be­liefs and de­sires, or prob­a­bil­ities and util­ities), so I’ll call them ‘de­ci­sion-the­o­retic mod­els’ of hu­man mo­ti­va­tion. We’ll dis­cuss de­ci­sion-the­o­retic mod­els again when we fi­nally get to the topic of neu­ro­science, but for now I want to dis­cuss a differ­ent ap­proach to mo­ti­va­tion.

Be­hav­iorism and Re­in­force­ment Learning

While neo­clas­si­cal economists for­mu­lated ex­pected util­ity the­ory, be­hav­iorist psy­chol­o­gists de­vel­oped a differ­ent set of ex­pla­na­tions for hu­man ac­tion. Though be­hav­iorists were wrong when they said that sci­ence can’t talk about men­tal ac­tivity or men­tal states, you can char­i­ta­bly think of be­hav­iorists as play­ing a game of Ra­tion­al­ist’s Ta­boo with con­structs of folk psy­chol­ogy like “want” or “fear” in or­der to get at phe­nom­ena more ap­pro­pri­ate for quan­tifi­ca­tion in tech­ni­cal ex­pla­na­tion. Also, the be­hav­iorist ap­proach led to ‘re­in­force­ment learn­ing’, an im­por­tant con­cept in the neu­ro­science of hu­man mo­ti­va­tion.

Be­fore I ex­plain re­in­force­ment learn­ing, let’s re­call op­er­ant con­di­tion­ing:

Stick a pi­geon in a box with a lever and some as­so­ci­ated ma­chin­ery (a “Sk­in­ner box”). The pi­geon wan­ders around, does var­i­ous things, and even­tu­ally hits the lever. Deli­cious sugar wa­ter squirts out. The pi­geon con­tinues wan­der­ing about and even­tu­ally hits the lever again. Another squirt of deli­cious sugar wa­ter. Even­tu­ally it per­co­lates into its tiny pi­geon brain that maybe push­ing this lever makes sugar wa­ter squirt out. It starts push­ing the lever more and more, each push con­tin­u­ing to con­vince it that yes, this is a good idea.

Con­sider a sec­ond, less lucky pi­geon. It, too, wan­ders about in a box and even­tu­ally finds a lever. It pushes the lever and gets an elec­tric shock. Eh, maybe it was a fluke. It pushes the lever again and gets an­other elec­tric shock. It starts think­ing “Maybe I should stop press­ing that lever.” The pi­geon con­tinues wan­der­ing about the box do­ing any­thing and ev­ery­thing other than push­ing the shock lever.

The ba­sic con­cept of op­er­ant con­di­tion­ing is that an an­i­mal will re­peat be­hav­iors that give it re­ward, but avoid be­hav­iors that give it pun­ish­ment.

Be­hav­iorism died in the wake of cog­ni­tive psy­chol­ogy, but its ap­proach to mo­ti­va­tion turned out to be very use­ful in the field of ar­tifi­cial in­tel­li­gence, where it is called re­in­force­ment learn­ing:

Re­in­force­ment learn­ing is learn­ing what to do — how to map situ­a­tions to ac­tions — so as to max­i­mize a nu­mer­i­cal re­ward sig­nal. The learner is not told which ac­tions to take, as in most forms of ma­chine learn­ing, but in­stead must dis­cover which ac­tions yield the most re­ward by try­ing them. In the most in­ter­est­ing and challeng­ing cases, ac­tions may af­fect not only the im­me­di­ate re­ward, but also the next situ­a­tion and, through that, all sub­se­quent re­wards. Th­ese two char­ac­ter­is­tics — trial-and-er­ror search and de­layed re­ward — are the two most im­por­tant dis­t­in­guish­ing fea­tures of re­in­force­ment learn­ing.

To ob­tain a lot of re­ward, a re­in­force­ment learn­ing agent must pre­fer ac­tions that it has tried in the past and found to be effec­tive in pro­duc­ing re­ward. But to dis­cover such ac­tions it has to try ac­tions that it has not se­lected be­fore. The agent has to ex­ploit what it already knows in or­der to ob­tain re­ward, but it also has to ex­plore in or­der to make bet­ter ac­tion se­lec­tions in the fu­ture. The dilemma is that nei­ther ex­ploita­tion nor ex­plo­ra­tion can be pur­sued ex­clu­sively with­out failing at the task. The agent must try a va­ri­ety of ac­tions and pro­gres­sively fa­vor those that ap­pear to be best..12

In ad­di­tion to the agent and its en­vi­ron­ment, there are four ma­jor com­po­nents of a re­in­force­ment learn­ing sys­tem:

...a policy, a re­ward func­tion, a value func­tion, and, op­tion­ally, a model of the en­vi­ron­ment.

A policy defines the learn­ing agent’s way of be­hav­ing at a given time. Roughly speak­ing, a policy is a map­ping from per­ceived states of the en­vi­ron­ment to ac­tions to be taken when in those states...

A re­ward func­tion defines the goal in a re­in­force­ment learn­ing prob­lem. Roughly speak­ing, it maps per­ceived states (or state-ac­tion pairs) of the en­vi­ron­ment to a sin­gle num­ber, a re­ward, in­di­cat­ing the in­trin­sic de­sir­a­bil­ity of the state. A re­in­force­ment-learn­ing agent’s sole ob­jec­tive is to max­i­mize the to­tal re­ward it re­ceives in the long run. …[A re­ward func­tion may] be used as a ba­sis for chang­ing the policy. For ex­am­ple, if an ac­tion se­lected by the policy is fol­lowed by low re­ward, then the policy may be changed to se­lect some other ac­tion in that situ­a­tion in the fu­ture...

Whereas a re­ward func­tion in­di­cates what is good in an im­me­di­ate sense, a value func­tion speci­fies what is good in the long run. Roughly speak­ing, the value of a state is the to­tal amount of re­ward an agent can ex­pect to ac­cu­mu­late over the fu­ture start­ing from that state. Whereas re­wards de­ter­mine the im­me­di­ate, in­trin­sic de­sir­a­bil­ity of en­vi­ron­men­tal states, val­ues in­di­cate the long-term de­sir­a­bil­ity of states af­ter tak­ing into ac­count the states that are likely to fol­low, and the re­wards available in those states. For ex­am­ple, a state might always yield a low im­me­di­ate re­ward, but still have a high value be­cause it is reg­u­larly fol­lowed by other states that yield high re­wards. Or the re­verse could be true...

Re­wards are in a sense pri­mary, whereas val­ues, as pre­dic­tions of re­wards, are sec­ondary. Without re­wards there could be no val­ues, and the only pur­pose of es­ti­mat­ing val­ues is to achieve more re­ward. Nev­er­the­less, it is val­ues with which we are most con­cerned when mak­ing and eval­u­at­ing de­ci­sions. Ac­tion choices are made on the ba­sis of value judg­ments. We seek ac­tions that bring about states of high­est value, not high­est re­ward, be­cause these ac­tions ob­tain for us the great­est amount of re­ward over the long run...

...The fourth and fi­nal el­e­ment of some re­in­force­ment learn­ing sys­tems is a model of the en­vi­ron­ment. This is some­thing that mimics the be­hav­ior of the en­vi­ron­ment. For ex­am­ple, given a state and ac­tion, the model might pre­dict the re­sul­tant next state and next re­ward. Models are used for plan­ning, by which we mean any way of de­cid­ing on a course of ac­tion by con­sid­er­ing pos­si­ble fu­ture situ­a­tions be­fore they are ac­tu­ally ex­pe­rienced.

Want an ex­am­ple? Here is how a re­in­force­ment learn­ing agent would learn to play Tic-Tac-Toe:

First we set up a table of num­bers, one for each pos­si­ble state of the game. Each num­ber will be the lat­est es­ti­mate of the prob­a­bil­ity of our win­ning from that state. We treat this es­ti­mate as the state’s cur­rent value, and the whole table is the learned value func­tion. State A has higher value than state B, or is con­sid­ered ‘bet­ter’ than state B, if the cur­rent es­ti­mate of the prob­a­bil­ity of our win­ning from A is higher than it is from B. As­sum­ing we always play Xs, then for all states with three Xs in a row the prob­a­bil­ity of win­ning is 1, be­cause we have already won. Similarly, for all states with three Øs in a row… the cor­rect prob­a­bil­ity is 0, as we can­not win from them. We set the ini­tial val­ues of all the other states, the non­ter­mi­nals, to 0.5, rep­re­sent­ing an in­formed guess that we have a 50% chance of win­ning.

Now we play many games against the op­po­nent. To se­lect our moves we ex­am­ine the states that would re­sult from each of our pos­si­ble moves (one for each blank space on the board) and look up their cur­rent val­ues in the table. Most of the time we move greed­ily, se­lect­ing the move that leads to the state with great­est value, that is, with the high­est es­ti­mated prob­a­bil­ity of win­ning. Oc­ca­sion­ally, how­ever, we se­lect ran­domly from one of the other moves in­stead; these are called ex­plo­ra­tory moves be­cause they cause us to ex­pe­rience states that we might oth­er­wise never see.

A se­quence of Tic-Tac-Toe moves might look like this:13

Solid lines are the moves our re­in­force­ment learn­ing agent made, and dot­ted lines are moves it con­sid­ered but did not make. The sec­ond move was an ex­plo­ra­tory move: it was taken even though an­other sibling move, that lead­ing to e*, was ranked higher.

While play­ing, the agent changes the val­ues as­signed to the states it finds it­self in. To im­prove its es­ti­mates con­cern­ing the prob­a­bil­ity of win­ning from var­i­ous states, it ‘backs up’ the value of state af­ter each greedy move to the state be­fore the move (as sug­gested by the ar­rows.) What this means is that the value of the ear­lier state is ad­justed to be closer to the value of the later state.

If we let s de­note the state be­fore the greedy move, and s’ the state af­ter, then the up­date to the es­ti­mated value of s, de­noted V(s), can be writ­ten:

V(s) ← V(s) + α[V(s’) - V(s)]

where α is a small pos­i­tive frac­tion called the step-size pa­ram­e­ter, which in­fluences the rate of learn­ing. The up­date rule is an ex­am­ple of a tem­po­ral differ­ence learn­ing method, so called be­cause its changes are based on a differ­ence… be­tween [value] es­ti­mates at two differ­ent times.

...if the step-size pa­ram­e­ter is re­duced prop­erly over time, this method con­verges, for any [un­chang­ing] op­po­nent, to the true prob­a­bil­ities of win­ning from each state give op­ti­mal play by the agent.

And that’s how a sim­ple ver­sion of tem­po­ral differ­ence (TD) re­in­force­ment learn­ing works.

Re­in­force­ment Learn­ing and De­ci­sion Theory

You may have no­ticed a key ad­van­tage of re­in­force­ment learn­ing: an agent us­ing it can be ‘dumber’ than a de­ci­sion-the­o­retic agent. It can just start with guesses (“What the hell; let’s try 50%!”) for the value of var­i­ous states, and then it learns their true val­ues by run­ning through many, many tri­als.

But what if you don’t have many tri­als to run through, and you need to make an im­por­tant de­ci­sion right now?

Then you have to be smart. You need to have a good model of the world and use de­ci­sion the­ory to choose the ac­tion with the high­est ex­pected util­ity.

This is pre­cisely what ra­tio­nal­ity be­ing good at build­ing cor­rect mod­els of the world is es­pe­cially good for:

For some tasks, the world pro­vides rich, in­ex­pen­sive em­piri­cal feed­back. In these tasks you hardly need rea­son­ing. Just try the task many ways… and take care to no­tice what is and isn’t giv­ing you re­sults.

Thus, if you want to learn to sculpt, [study­ing ra­tio­nal­ity] is a bad way to go about it. Bet­ter to find some clay and a hands-on sculpt­ing course. The situ­a­tion is similar for small talk, cook­ing, sel­l­ing, pro­gram­ming, and many other use­ful skills.

Un­for­tu­nately, most of us also have goals for which we can ob­tain no such ready suc­cess/​failure data. For ex­am­ple, if you want to know whether cry­on­ics is a good buy, you can’t just try buy­ing it and not-buy­ing it and see which works bet­ter. If you miss your first bet, you’re out for good.

Re­in­force­ment learn­ing can be a good strat­egy if you have time to learn from many tri­als. If you’ve only got one shot at a prob­lem, you’d bet­ter build up a re­ally ac­cu­rate model of the world first and then try to max­i­mize ex­pected util­ity.

Now, back to our story.

It turns out that re­in­force­ment learn­ing seems to un­der­lie many of our men­tal pro­cesses. (More on this later.)

The les­son Yvain drew from this dis­cov­ery was:

Re­in­force­ment learn­ing is evolu­tion writ small; be­hav­iors prop­a­gate or die out based on their con­se­quences to re­in­force­ment in a mind, just as mu­ta­tions prop­a­gate or die out based on their con­se­quences to re­pro­duc­tion in an or­ganism. In the be­hav­iorist model, our mind is not an agent, but a flour­ish­ing ecosys­tem of be­hav­iors both phys­i­cal and men­tal, all scrab­bling for supremacy and mu­tat­ing into more effec­tive ver­sions of them­selves.

Just as evolv­ing or­ganisms are adap­ta­tion-ex­ecu­tors and not fit­ness-max­i­miz­ers, so minds are be­hav­ior-ex­ecu­tors and not util­ity-max­i­miz­ers.

But things are a bit more com­pli­cated than that, as we’ll now see.

The Turn to the Brain

I hes­i­tate to say that men will ever have the means of mea­sur­ing di­rectly the feel­ings of the hu­man heart. It is from the quan­ti­ta­tive effects of the feel­ings that we must es­ti­mate their com­par­a­tive amounts.

William Jevons (1871)

It turns out that Jevons was wrong. Modern neu­ro­science al­lows us to peer into the black box of the hu­man value sys­tem and mea­sure di­rectly “the feel­ings of the hu­man heart.”14

We’ll be­gin with the ex­per­i­ments of Wolfram Shultz. Schultz recorded the ac­tivity of sin­gle dopamine neu­rons in mon­keys who sat in front of a wa­ter spout. At ir­reg­u­lar in­ter­vals, a speaker played a tone and a drop of wa­ter dropped from the spout.15 The mon­keys’ dopamine neu­rons nor­mally fired at the baseline rate, but re­sponded with a burst of ac­tivity when wa­ter was de­liv­ered. Over time, though, the neu­rons re­sponded less and less to the wa­ter and more and more to the tone.

But if Schultz de­liv­ered wa­ter with­out first giv­ing the tone, then the dopamine neu­rons re­sponded with a burst of ac­tivity again. And if he played the tone and didn’t provide wa­ter, the neu­rons re­duced their firing rates be­low the baseline. The neu­rons weren’t re­spond­ing to the wa­ter it­self but to a differ­ence be­tween ex­pected re­ward and ac­tual re­ward — a re­ward pre­dic­tion er­ror (RPE).

Two other re­searchers, Read Mon­tague and Peter Dayan, no­ticed that these pat­terns of neu­ronal ac­tivity were ex­actly pre­dicted by TD re­in­force­ment learn­ing the­ory from com­puter sci­ence.16 In par­tic­u­lar, the RPE ob­served in neu­rons ap­peared to play the same role in mon­key learn­ing as the differ­ence be­tween value es­ti­mates at two differ­ent times did in TD re­in­force­ment learn­ing the­ory.

Since then, re­searchers have done many more sin­gle-neu­ron record­ing stud­ies to test par­tic­u­lar ver­sions of TD re­in­force­ment learn­ing and re­vise the the­ory un­til it pre­dicts more and more be­hav­ior while also pre­dict­ing novel ex­per­i­men­tal dis­cov­er­ies.

Caplin & Dean17 pro­vided an­other way to test the hy­poth­e­sis that dopamine neu­rons en­coded RPE in a TD-class model. They showed that all ex­ist­ing RPE-mod­els could be re­duced to three ax­io­matic state­ments. If a sys­tem vi­o­lated one of these ax­ioms, it could not be an RPE sys­tem. Later, Caplin et al. (2010) tested the ax­ioms on ac­tual brain ac­tivity to see if they held up. They did. This is an­other rea­son why so many sci­en­tists work­ing in this field be­lieve the cur­rent ‘dopamine hy­poth­e­sis’ — that dopamine neu­rons en­code RPE in a TD-class re­in­force­ment learn­ing sys­tem in the brain.

TD-class re­in­force­ment learn­ing works in com­put­ers by up­dat­ing num­bers that rep­re­sent the val­ues of states. How does re­in­force­ment learn­ing work when us­ing nerve cells?

Heb­bian Learning

By Heb­bian learn­ing, of course. “Cells that fire to­gether, wire to­gether.”

Imag­ine a neu­ral path­way (in one of Pavlov’s dogs) that con­nects the neu­ral cir­cuits that sense the ring­ing of a bell to the neu­ral cir­cuits for sal­i­va­tion. This is a weak con­nec­tion at first, which is why the bell doesn’t ini­tially elicit sal­i­va­tion.

Also imag­ine a third neu­ron that con­nects the sal­i­va­tion cir­cuit to a cir­cuit that de­tects food. This is a strong con­nec­tion, and that’s why food does elicit sal­i­va­tion right away:18

Don­ald Hebb pro­posed:

When an axon of cell A is near enough to ex­cite cell B and re­peat­edly or per­sis­tently take part in firing it, a growth pro­cess of metabolic change takes place in one or both cells such that A’s effi­cacy, as one of the cells firing B, is in­creased.19

In short, when­ever two con­nected cells are ac­tive at the same time, the synapses con­nect­ing them are strength­ened.

Con­sider Pavlov’s ex­per­i­ment. At first, the Bell cell will fire when­ever bells ring, but prob­a­bly not when the sal­i­va­tion cells hap­pen to be ac­tive. So, the con­nec­tion be­tween the Bell cell and the Sal­i­va­tion cell re­mains weak. But then, Pavlov in­ter­venes and causes the Bell cell and the Sal­i­va­tion cell to fire at the same time by ring­ing the bell and pre­sent­ing food at the same time (the Food de­tec­tor cell already has a strong con­nec­tion to the Sal­i­va­tion cell). When­ever the Bell cell and the Sal­i­va­tion cell hap­pen to fire at the same time, the synapse be­tween them is strength­ened. Once the con­nec­tion is strong enough, the Bell cell can cause the Sal­i­va­tion cell to fire on its own, just like the Food de­tec­tor cell can.

It was a fine the­ory, but it wasn’t ob­served un­til Bliss & Lomo (1973) ob­served Hebb’s mechanism at work in the rab­bit hip­pocam­pus. To­day, we know how some forms of Hebb’s mechanism work at the molec­u­lar level.20

Later, Wick­ens (1993) pro­posed a similar mechanism called the three-fac­tor rule, ac­cord­ing to which some synapses are strength­ened when­ever presy­nap­tic and post­sy­nap­tic ac­tivity oc­curred in the pres­ence of dopamine. Th­ese same synapses might be weak­ened when ac­tivity oc­curred in the ab­sence of dopamine. Later stud­ies con­firmed this hy­poth­e­sis.21

Sup­pose a mon­key re­ceives an un­ex­pected re­ward and en­codes a large pos­i­tive RPE. Glim­cher ex­plains:

The TD model tells us that un­der these con­di­tions we want to in­cre­ment the value at­tributed to all ac­tions or sen­sa­tions that have just oc­curred. Un­der these con­di­tions, we know that the dopamine neu­rons re­lease dopamine through­out the fron­to­cor­ti­cal-basal gan­glia loops, and do so in a highly ho­moge­nous man­ner. That means we can think of any neu­ron equipped with dopamine re­cep­tors as ‘primed’ for synap­tic strength­en­ing. When this hap­pens, any seg­ment of the fron­to­cor­ti­cal-basal gan­glia loop that is already ac­tive will have its synapses strength­ened.22

We will re­turn to the dopamine sys­tem later, but for now let us back up and pur­sue the neo­clas­si­cal eco­nomic path into the brain.

Ex­pected Utility in Neurons

Ever since Fried­man (1953), economists have in­sisted that hu­mans only be­have as if they are util­ity max­i­miz­ers, not that they ac­tu­ally com­pute ex­pected util­ity and try to max­i­mize it.

It was a sur­prise, then, when neu­ro­scien­tists stum­bled upon the neu­rons that were en­cod­ing ex­pected util­ity in their firing rates.

Tanji & Evarts (1976) did their ex­per­i­ments with rhe­sus mon­keys be­cause they are our clos­est rel­a­tive be­sides the apes, and this kind of work is usu­ally for­bid­den on apes for eth­i­cal rea­sons (we need to im­plant a record­ing elec­trode in the brain).

The mon­keys were trained to know that a col­ored light on the screen meant they would soon be offered a re­ward (a drop of wa­ter) ei­ther for push­ing or pul­ling, but not for both. This was the ‘ready’ cue. A sec­ond later, re­searchers gave a ‘di­rec­tion’ cue that told the mon­keys which ac­tion — push or pull — was go­ing to be re­warded. The third cue was the ‘go’ sig­nal: if the mon­key made the pre­vi­ously in­di­cated move­ment, it was re­warded.

This is what they saw:

At the ‘ready’ cue, the neu­rons as­so­ci­ated with a push­ing mo­tion be­came weakly ac­tive (but fired above the baseline rate), and so did the neu­rons as­so­ci­ated with a pul­ling mo­tion. When the ‘di­rec­tion’ cue was given, the neu­rons as­so­ci­ated with the to-be-re­warded mo­tion dou­bled their firing rate, and the neu­rons as­so­ci­ated with the op­po­site mo­tion fell back to the baseline rate. Then at the ‘go’ cue, the neu­rons as­so­ci­ated with the to-be-re­warded move­ment in­creased again rapidly, up past the thresh­hold re­quired to pro­duce move­ment, and the move­ment was pro­duced shortly there­after.

One tempt­ing ex­pla­na­tion of the data is that af­ter the ‘ready’ cue, the mon­key’s brain ‘de­cides’ there’s a 50% chance that pul­ling will get the re­ward, and a 50% chance that push­ing will get the re­ward. That’s why we see the neu­ron firing rates as­so­ci­ated with those two ac­tions each jump to slightly less than 50% of the move­ment thresh­old when the ‘ready’ cue is given. But then, when the ‘di­rec­tion’ cue is given, those ex­pec­ta­tions shift to 100%/​0% or 0%/​100%, de­pend­ing on which ac­tion is about to be re­warded ac­cord­ing to the ‘di­rec­tion’ cue. That’s why ac­tivity in the cir­cuit as­so­ci­ated with the to-be-re­warded ac­tion dou­bles and the other one drops to baseline. And then the ‘go’ cue is de­liv­ered and firing rates blast past the move­ment thresh­old, and move­ment is pro­duced.

Let’s jump ahead to Basso & Wurtz (1997), who did a similar ex­per­i­ment ex­cept that they used vol­un­tary eye move­ments (called ‘sac­cades’) in­stead of vol­un­tary arm move­ments. And this time, they pre­sented each mon­key with one, two, four, or eight pos­si­ble tar­gets, in­stead of just two tar­gets (push and pull) like Tanji & Evarts did.

What they found was that as more po­ten­tial tar­gets were pre­sented, the mag­ni­tude of the prepara­tory ac­tivity as­so­ci­ated with each tar­get sys­tem­at­i­cally de­creased. And again, once the ‘di­rec­tion’ and ‘go’ cues were pre­sented, the ac­tivity as­so­ci­ated with those other po­ten­tial tar­gets dropped rapidly and ac­tivity burst rapidly in neu­rons as­so­ci­ated with the to-be-re­warded move­ment. It was as though the mon­keys’ brains were dis­tribut­ing their prob­a­bil­ity mass evenly across the po­ten­tially re­warded ac­tions, and then once they knew which ac­tion should in fact be re­warded, they moved all their prob­a­bil­ity mass to that ac­tion and performed the ac­tion and got the re­ward.

Real-Time Ex­pected Utility Updates

Other re­searchers showed mon­keys a black screen with flick­er­ing white dots on it. In each frame of the video, the com­puter moved each dot in a ran­dom di­rec­tion. The in­de­pen­dent vari­able was a mea­sure called ‘co­her­ence.’ In a 100% left­ward co­her­ence con­di­tion, all dots moved to the left. In a 60% right­ward con­di­tion, 60% of the dots move right­ward while the rest moved ran­domly. And so on.

In a typ­i­cal ex­per­i­ment, the re­searchers would iden­tify a neu­ron in a mon­key’s brain that in­creased its firing rate in re­sponse to right­ward co­her­ence of the dots, and de­creased its firing rate in re­sponse to left­ward co­her­ent of the dots. Then they would pre­sent the mon­key with a se­quence (in ran­dom or­der) of ev­ery pos­si­ble left­ward and right­ward co­her­ence con­di­tion.

A left­ward co­her­ence (of any mag­ni­tude) meant the mon­key would be re­warded for left­ward eye move­ment, and a right­ward co­her­ence meant the mon­key would be re­warded for right­ward eye move­ment. But, the mon­key had to wait two sec­onds be­fore be­ing re­warded.

In this ex­per­i­ment, the prob­a­bil­ities always started at 50% but then up­dated con­tin­u­ously. A 100% right­ward co­her­ence con­di­tion al­lowed the mon­key to very quickly know which vol­un­tary eye move­ment would be re­warded, but in a 5% right­ward co­her­ence con­di­tion the ex­pected util­ity of the right­ward tar­get grew more slowly.

The re­sults? The greater the co­her­ence of right­ward mo­tion of the dots, the faster the neu­rons as­so­ci­ated with right­ward eye move­ment in­creased their firing rate. (A higher co­her­ence meant the mon­key was able to up­date its prob­a­bil­ities more quickly.)

Argmax and Reser­va­tion Price

Many stud­ies show that the brain con­trols move­ment by way of a ‘win­ner take all’ mechanism that is iso­mor­phic to the argmax op­er­a­tion from eco­nomics.23 That is, there are many pos­si­bil­ities com­pet­ing for your fi­nal choice, but just be­fore your choice the sin­gle strongest sig­nal re­mains af­ter all the oth­ers are in­hibited.

This choice mechanism was in­ves­ti­gated in more de­tail by Michael Shadlen and oth­ers.24 Shadlen gave mon­keys the same eye move­ment task as above, ex­cept that the mon­keys could make their choice at any time in­stead of wait­ing for two sec­onds. He found that:

  1. When the di­rec­tion of the dots is un­am­bigu­ous, mon­keys make their choices quickly.

  2. As the di­rec­tion of the dots be­comes more am­bigu­ous, they take longer to make their choices.

  3. Through­out the ex­per­i­ment, the firing rates of neu­rons in the LIP (part of the ‘fi­nal com­mon path’ for gen­er­at­ing eye move­ment) grew to­ward a spe­cific thresh­old level.

The thresh­old level acts as a kind of crite­rion of choice. Once the crite­rion is met, ac­tion is taken. Or in eco­nomic terms, the mon­keys seemed to set a reser­va­tion price on mak­ing cer­tain move­ments.25

Ran­dom Utility

When de­cid­ing be­tween goods of differ­ent ex­pected util­ities, hu­mans ex­hibit a stochas­tic trans­fer func­tion:

Con­sider a hu­man sub­ject choos­ing be­tween two ob­jects of highly differ­ent ex­pected util­ities, such as a first lot­tery with a 50% chance of win­ning $5 and a sec­ond lot­tery with a 25% chance of win­ning $5. We ob­serve highly de­ter­minis­tic be­hav­ior un­der these con­di­tions: ba­si­cally all sub­jects always choose the 5o% chance of win­ning $5. But what hap­pens when we in­cre­ment the value of the 25% lot­tery? As the amount one stands to win from that lot­tery is in­cre­mented, in­di­vi­d­ual sub­jects even­tu­ally switch their prefer­ence. Ex­actly when they make that switch de­pends on their idiosyn­cratic de­gree of risk aver­sion. What is most in­ter­est­ing about this be­hav­ior for these pur­poses, though, is that ac­tual hu­man sub­jects, when pre­sented with this kind of choice re­peat­edly, are never com­pletely de­ter­minis­tic. As the value of the 25% lot­tery in­creases, they be­gin to show prob­a­bil­is­tic be­hav­ior — se­lect­ing the 25% lot­tery some­times, but not always.26

Our be­hav­ior has an el­e­ment of ran­dom­ness in it. Daniel McFad­den won a No­bel Prize in eco­nomics for cap­tur­ing such be­hav­ior us­ing a ran­dom util­ity model.27 The way he did it is to sup­pose that when a chooser asks him­self what a thing is worth, he doesn’t get a fixed an­swer but a vari­able one. That is, there is ac­tual vari­a­tion in his prefer­ences. Thus, his ex­pected util­ity for a par­tic­u­lar lot­tery is drawn from a dis­tri­bu­tion of pos­si­ble util­ities, usu­ally one with a Gaus­sian var­i­ance.28

This be­hav­ior makes sense when we think about the hu­man choice mechanism at the neu­ronal level, be­cause neu­ron firing rates are stochas­tic.29 When a neu­ro­biol­o­gist says “The neu­ron was firing at 200 Hz,” what she means is that the mean firing rate of the neu­ron over a long time and sta­ble con­di­tions would have been close to 200 Hz. So the neu­rons that en­code util­ity (wher­ever they are) will ex­hibit stochas­tic­ity, and thereby in­tro­duce some ran­dom­ness into our choices. In this way, neu­ro­biolog­i­cal data con­strains our eco­nomic mod­els of hu­man be­hav­ior. An eco­nomic model with­out some ran­dom­ness in it will have difficulty cap­tur­ing hu­man choices for as long as hu­mans run on neu­rons.30


Louie & Glim­cher (2010) ex­am­ined tem­po­ral dis­count­ing in the brain. The two mon­keys in this study were re­peat­edly asked to choose be­tween a small, im­me­di­ately available re­ward and a larger re­ward available af­ter a small de­lay. For ex­am­ple, on one day they were asked to choose be­tween 0.13 mil­lileters of juice right now, or else 0.2 mil­lileters of juice available af­ter a de­lay of 2, 4, 8, or 12 sec­onds. A mon­key might be will­ing to wait 2, 4, or 8 sec­onds for the larger re­ward, but not 12 sec­onds.

After many, many mea­sure­ments of this kind, Louie and Glim­cher were able to de­scribe the dis­count­ing func­tion be­ing used by each mon­key. (One of them was more im­pa­tient than the other.)

More­over, the neu­rons in the rele­vant sec­tion of the brain fired at rates that re­flected each mon­key’s dis­count­ing func­tion. If 0.2 mil­lileters of juice was offered with no de­lay, the neu­rons were highly ac­tive. If the same re­ward was offered at a de­lay of 2 sec­onds, they were slightly less ac­tive. If the same re­ward was offered af­ter 4 sec­onds, the neu­rons were less ac­tive still. And so on. As it turned out, the dis­count­ing func­tion that cap­tured their choices was iden­ti­cal to the dis­count­ing func­tion that cap­tured the firing rates of these neu­rons.

This shouldn’t be a sur­prise at this point, but just to con­firm: Yes, we can ob­serve dis­count­ing in the firing rates of neu­rons in­volved in the choice-mak­ing pro­cess.

Rel­a­tive and Ab­solute Utility

Dor­ris & Glim­cher (2004) ob­served mon­keys and their choice mechanism neu­rons while the mon­keys en­gaged in re­peated plays of the in­spec­tion game. The study is too in­volved for me to ex­plain here, but the re­sults sug­gested that choice mechanism neu­rons en­code rel­a­tive ex­pected util­ities (rel­a­tive to other ac­tions un­der con­sid­er­a­tion) rather than ab­solute ex­pected util­ities.

Tobler et al. (2005) sug­gested that the brain only en­codes rel­a­tive ex­pected util­ities. But there is rea­son to sus­pect this can’t be right. If we stored only rel­a­tive ex­pected util­ities, then we would rou­tinely vi­o­late the ax­iom of tran­si­tivity (if you pre­fer A to B and B to C, you can’t also pre­fer C to A). To see why this is the case, con­sider Glim­cher’s ex­am­ple (he says ‘ex­pected value’ in­stead of ‘util­ity’):

...con­sider a sub­ject trained to choose be­tween ob­jects A and B, where A is $1,000,000 worth of goods and B is $100,000 worth of goods… A sys­tem that rep­re­sented only the rel­a­tive ex­pected sub­jec­tive value of A and B would rep­re­sent SV(A) > SV(B). Next, con­sider train­ing the same sub­ject to choose be­tween C and D, where C is $1,000 worth of goods and D is $100 worth of goods. Such a sys­tem would rep­re­sent SV(C) > SV(D). What hap­pens when we ask a chooser to se­lect be­tween B and C? For a chooser who rep­re­sents only rel­a­tive ex­pected sub­jec­tive value, the choice should be C: she should pick $1,000 worth of goods over $100,000 worth of goods be­cause it has a higher learned rel­a­tive ex­pected sub­jec­tive value. In or­der for our chooser to… con­struct tran­si­tive prefer­ences across choice sets (and to obey the con­ti­nu­ity ax­iom)… it is re­quired that some­where in the brain she rep­re­sent the ab­solute sub­jec­tive val­ues of her choices.31

And we mostly do seem to obey the ax­iom of tran­si­tivity.

So if the choice mechanism neu­rons do rep­re­sent rel­a­tive util­ities, then some other neu­rons el­se­where must en­code a more ab­solute form of util­ity. Other im­pli­ca­tions of this are ex­plored in the next sec­tion.


David Heeger showed32 that the firing rates of ‘fea­ture de­tec­tor’ neu­rons in the vi­sual cor­tex cap­tured a re­sponse to a fea­ture in the vi­sual field di­vided by the sum of the ac­ti­va­tion rates of nearby neu­rons sen­si­tive to the same image. Thus, these neu­rons en­code not only whether they ‘see’ the fea­ture they are built to de­tect, but also how unique it is in the vi­sual field.

The effect of this is that neu­rons re­act­ing to the edge of a vi­sual ob­ject fire more ac­tively than oth­ers do. Be­hold! Edge de­tec­tion!

It’s also an effi­cient way to en­code in­for­ma­tion about the world. Con­sider a world where or­ange dots are ubiquitous. For an an­i­mal in that world, it would be waste­ful to fire ac­tion po­ten­tials to rep­re­sent or­ange dots. Bet­ter to rep­re­sent the ab­sence of or­ange dots, or the tran­si­tion from or­ange dots to some­thing else. An op­ti­mally effi­cient en­cod­ing method would be sen­si­tive not to the ‘alpha­bet’ of all pos­si­ble in­puts, but to a smaller alpha­bet of the in­puts that ac­tu­ally ap­pear in the world. This in­sight was math­e­mat­i­cally for­mal­ized by Schwartz & Si­mon­celli (2001).

The effi­ciency of this nor­mal­iza­tion tech­nique may ex­plain why we’ve dis­cov­ered it at work in so many differ­ent places in the brain.33 And given that we’ve found it al­most ev­ery­where we’ve looked for it, it wouldn’t be a sur­prise to see it show up in our choice-mak­ing cir­cuits. In­deed, Si­mon­celli & Schwartz’s nor­mal­iza­tion equa­tion may be what our brains use to en­code ex­pected util­ities that are rel­a­tive to the other choices un­der con­sid­er­a­tion.

One im­pli­ca­tion of their equa­tion is that a chooser’s er­rors be­come more fre­quent as the size of the choice set grows. Thus, be­hav­ioral er­rors on small choice sets should be rarer than might be pre­dicted by most ran­dom util­ity mod­els, but er­ror rates will in­crease rapidly with choice set size (and be­yond a cer­tain choice set size, choices will ap­pear ran­dom).

Pre­limi­nary ev­i­dence that choice set size effects er­ror rates has ar­rived from be­hav­ioral eco­nomics. For ex­am­ple, con­sider Iyen­gar & Lep­per’s (2000) study of su­per­mar­ket shop­pers. They set up a table show­ing ei­ther 6 or 24 fla­vors of jams, al­low­ing shop­pers to sam­ple as many as they wanted. Cus­tomers who saw 24 fla­vors had a 3% chance of buy­ing a jar, while those who saw only 6 fla­vors had a 30% chance!

In an­other ex­per­i­ment, Iyen­gar & Lep­per let sub­jects choose one of ei­ther 6 or 30 differ­ent choco­lates. Those who chose from among only 6 op­tions were more satis­fied with their se­lec­tion than those who had been pre­sented with 30 differ­ent choco­lates.

Th­ese data fit our ex­pec­ta­tion that as the choice set grows, the fre­quency of er­rors in our be­hav­ior rises and the like­li­hood that an op­tion will rise above the thresh­old for pur­chase drops. When Louie & Glim­cher (2010) in­ves­ti­gated this phe­nom­ena in mon­key choice mechanism neu­rons, they found it at work there, too. But the pro­cess of choice-set edit­ing is still poorly un­der­stood, and some re­cent stud­ies have failed to repli­cate Iyen­gar & Lep­per’s re­sults (Scheibe­henne et al. 2010).

Per­haps the most sur­pris­ing im­pli­ca­tion of these find­ings is that be­cause of neu­ronal stochas­tic­ity, and be­cause er­rors in­crease as the choice set grows, we should ex­pect stochas­tic vi­o­la­tions of the in­de­pen­dence ax­iom, and that when choosers face very large choice sets they will es­sen­tially ig­nore the in­de­pen­dence ax­iom.

This is a pre­dic­tion about hu­man be­hav­ior not made by ear­lier mod­els from neo­clas­si­cal eco­nomics, but it is sug­gested by look­ing at the neu­rons in­volved in hu­man choice-mak­ing.

Are Ac­tions Choices?

But all these data come from ex­per­i­ments where the choices are ac­tions, and from our knowl­edge of the brain’s “fi­nal com­mon path” for pro­duc­ing ac­tions. How do ac­tions map on to choices about lovers and smart­phones?

Stud­ies by Greg Horow­itz have pro­vided some rele­vant data, be­cause mon­keys had to choose op­tions iden­ti­fied by color rather than by ac­tion.34 For ex­am­ple in one trial, a ‘red’ op­tion might offer one re­ward and a ‘green’ op­tion might offer a differ­ent re­ward. On each trial, the red and green op­tions would ap­pear at ran­dom places on the com­puter screen, and the mon­key could choose a re­ward with a vol­un­tary eye move­ment. The key here is that re­wards were cho­sen by color and not by a (par­tic­u­lar) ac­tion.

Horow­itz found that the choice mechanism neu­rons showed the same pat­tern of ac­ti­va­tion un­der these con­di­tions as was the case un­der ac­tion-based choice tasks.

So, it looks like the val­u­a­tion cir­cuits can store the value of a col­ored tar­get, and these val­u­a­tions can be mapped to the choice mechanism. But we don’t know much about how this works, yet.

The Pri­mate Choice Mechanism: A Brief Review

Thus far, we have mostly dis­cussed the pri­mate brain’s choice mechanism. To re­view:

  1. The choice cir­cuit re­sides in the fi­nal com­mon path­way for ac­tion.

  2. It takes as its in­put a sig­nal that en­codes stochas­tic ex­pected util­ity, a con­cept al­igned to the ran­dom util­ity term in eco­nomic mod­els pro­posed by McFad­den (2005) and Gul & Pe­sendorfer (2006).

  3. This in­put sig­nal is rep­re­sented by a nor­mal­ized firing rate (with Pois­son var­i­ance, like all neu­rons).

  4. As the choice set size grows, so does the er­ror rate.

  5. Fi­nal choice is im­ple­mented by an argmax func­tion or a reser­va­tion price mechanism. (A sin­gle cir­cuit can achieve both modes.35)

But how are prob­a­bil­ity and util­ity calcu­lated such that they can be fed into the ex­pected util­ity rep­re­sen­ta­tions of the choice mechanism? I won’t dis­cuss how the brain forms prob­a­bil­is­tic be­liefs in this ar­ti­cle,36 so let us turn to the study of how util­ity is calcu­lated in the brain: the ques­tion of val­u­a­tion.

Marginal Utility and Refer­ence Dependence

Con­sider the fol­low­ing story:

Imag­ine an an­i­mal ex­plor­ing a novel en­vi­ron­ment from a nest on a day when both (1) its blood con­cen­tra­tion is dilute (and thus its need for wa­ter is low) and (2) its blood sugar level is low (and thus its need for food is high). The an­i­mal trav­els west one kilo­me­ter from the nest and emerges from the un­der­growth into an open clear­ing at the shores of a large lake. Not very thirsty, the an­i­mal bends down to sam­ple the wa­ter and finds it… un­palat­able… the next day the same an­i­mal leaves its nest in the same metabolic state and trav­els one kilo­me­ter to the east, where it dis­cov­ers a grove of trees that yield a dry but nu­tri­tious fruit, a grove of dried apri­cot trees. It sam­ples the fruit and finds it sweet and highly palat­able.

What has the an­i­mal ac­tu­ally learned about the value of go­ing west and the value of go­ing east? It has had a weakly nega­tive ex­pe­rience, in the psy­cholog­i­cal sense, when go­ing west and a very pos­i­tive ex­pe­rience when go­ing east. Do these sub­jec­tive prop­er­ties of its ex­pe­rience in­fluence what it has learned? Do the stored rep­re­sen­ta­tions de­rived from these ex­pe­riences en­code the ac­tual ob­jec­tive val­ues of go­ing west and east, or do they en­code the sub­jec­tive ex­pe­riences? That is a crit­i­cal ques­tion about what the an­i­mal has learned, be­cause it de­ter­mines what it does when it wakes up thirsty. When it wakes up thirsty it should, in a nor­ma­tive sense, go west to­wards the… lake, de­spite the fact that its pre­vi­ous visit west was a nega­tive ex­pe­rience.37

Economists have known this prob­lem for a long time, and solved it with an idea called marginal util­ity.

In neo­clas­si­cal eco­nomics, we view the an­i­mal as hav­ing two kinds of ‘wealth’: a sugar wealth and a wa­ter wealth (the to­tal store of sugar and wa­ter in the an­i­mal’s body at a given time). A piece of fruit or a sip of wa­ter is an in­cre­ment in the an­i­mal’s to­tal sugar or wa­ter wealth. The util­ity of a piece of fruit or a sip of wa­ter, then, de­pends on its cur­rent lev­els of sugar and wa­ter wealth.

On day one, the an­i­mal’s need for wa­ter is low and its need for sugar is high. On that day, the marginal util­ity of a piece of fruit is greater than the marginal util­ity of a sip of wa­ter. But sup­pose dur­ing the next week the an­i­mal has a high blood sugar level. At that time, the marginal util­ity of a piece of fruit is low. Thus, the marginal util­ity of a con­sum­able re­source de­pends on wealth. The wealthier the chooser, the lower the marginal util­ity pro­vided by a fixed amount of gain (‘diminish­ing marginal util­ity’).

In neo­clas­si­cal eco­nomics, the an­i­mal faced with the op­tion of go­ing east or west in the morn­ing would first es­ti­mate how much the wa­ter and the fruit would change its ob­jec­tive wealth level, and then it would es­ti­mate how much those ob­jec­tive changes in wealth would change its util­ity. That is, it would use ob­jec­tive val­ues to com­pute its marginal (sub­jec­tive) util­ity. If it only had ac­cess to the sub­jec­tive ex­pe­riences in our story, it couldn’t com­pute a new marginal util­ities when it finds it­self un­ex­pect­edly thirsty.

The prob­lem with this solu­tion is that the brain does not ap­pear to en­code the ob­jec­tive val­ues of stim­uli, and hu­mans be­hav­iorally don’t seem to re­spect the ob­jec­tive val­ues of op­tions ei­ther, as dis­cussed here.

In re­sponse to the be­hav­ioral ev­i­dence, Kah­ne­man & Tver­sky (1979) de­vel­oped a refer­ence de­pen­dent util­ity func­tion to de­scribe hu­man be­hav­ior: prospect the­ory. Their sug­ges­tion was, ba­si­cally:

Rather than com­put­ing marginal util­ities against [ob­jec­tive] wealth as in [stan­dard neo­clas­si­cal eco­nomic mod­els], util­ities (not marginal util­ities) could be com­puted di­rectly as de­vi­a­tions from a baseline level of wealth, and then choices could be based on di­rect com­par­i­sons of these util­ities rather than on com­par­i­sons of marginal util­ities. Their idea was to be­gin with some­thing like the chooser’s sta­tus quo, how much wealth he thinks he has. Each gam­ble is then rep­re­sented as the chance of win­ning or los­ing util­ities rel­a­tive to that sta­tus-quo-like refer­ence point.38

This fits with the neu­ro­biolog­i­cal fact that we en­code sig­nals from ex­ter­nal stim­uli rel­a­tive to refer­ence points, and don’t have ac­cess to the ob­jec­tive val­ues of stim­uli.

The ad­van­tage of the neo­clas­si­cal eco­nomic model is that it keeps a chooser’s choices con­sis­tent. The ad­van­tage of the refer­ence-de­pen­dent ap­proach is that it bet­ter fits hu­man be­hav­ior and hu­man neu­ro­biol­ogy.

Most neo­clas­si­cal economists seem to ig­nore the prob­lems for their the­o­ries that are pre­sented by refer­ence de­pen­dence in hu­man be­hav­ior and hu­man neu­ro­biol­ogy, but two neo­clas­si­cal economists at Berkeley, Matthew Rabin and Bo­tond Koszegi, have be­gun to take refer­ence de­pen­dence se­ri­ously. As they put it:

...while an un­ex­pected mon­e­tary wind­fall in the lab may be as­sessed as a gain, a salary of $5o,000 to an em­ployee who ex­pected $60,000 will not be as­sessed as a large gain rel­a­tive to sta­tus-quo wealth, but rather as a loss rel­a­tive to ex­pec­ta­tions of wealth. And in non­durable con­sump­tion — where there is no ob­ject with which the per­son can be en­dowed — a sta­tus-quo-based the­ory can­not cap­ture the role of refer­ence de­pen­dence at all: it would pre­dict, for in­stance, that a per­son who misses a con­cert she ex­pected to at­tend would feel no differ­ently than some­body who never ex­pected to see the con­cert.39

Their refer­ence-de­pen­dent model makes par­tic­u­lar pre­dic­tions:

[Our the­ory] shows that a con­sumer’s will­ing­ness to pay a given price for shoes de­pends on the prob­a­bil­ity with which she ex­pected to buy them and the price she ex­pected to pay. On the one hand, an in­crease in the like­li­hood of buy­ing in­creases a con­sumer’s sense of loss of shoes if she does not buy, cre­at­ing an “at­tach­ment effect” that in­creases her will­ing­ness to pay. Hence, the greater the like­li­hood she thought prices would be low enough to in­duce pur­chase, the greater is her will­ing­ness to buy at higher prices. On the other hand, hold­ing the prob­a­bil­ity of get­ting the shoes fixed, a de­crease in the price a con­sumer ex­pected to pay makes pay­ing a higher price feel like more of a loss, cre­at­ing a “com­par­i­son effect” that low­ers her will­ing­ness to pay the high price. Hence, the lower the prices she ex­pected among those prices that in­duce pur­chase, the lower is her will­ing­ness to buy at higher prices.

Thus, the cost of ac­cept­ing the hu­man fact of refer­ence-de­pen­dence is that we have to ad­mit that hu­mans are ir­ra­tional (in the sense of ‘ra­tio­nal­ity’ defined by the ax­ioms of re­vealed prefer­ence):

The fact that a con­sumer will pay more for shoes she ex­pected to buy than for shoes she did not ex­pect to buy, or that an an­i­mal would pre­fer in­fe­rior fruit it ex­pected to eat over su­pe­rior fruit it did not ex­pect to eat, is ex­actly the kind of ir­ra­tional be­hav­ior that we might hope the pres­sures of evolu­tion would pre­clude. What ob­ser­va­tions tell us, how­ever, is that these be­hav­iors do oc­cur. The neu­ro­science of sen­sory en­cod­ing tells us that these be­hav­iors are an in­escapable product of the fun­da­men­tal struc­ture of our brains.40

But re­ally, shouldn’t it have been ob­vi­ous all along that hu­mans are ir­ra­tional? Per­haps it is, to ev­ery­one but neo­clas­si­cal economists and Aris­tote­leans. (Okay, enough teas­ing...)

One thing to keep in mind is that the brain en­codes in­for­ma­tion about the ex­ter­nal world in a refer­ence-de­pen­dent way be­cause that method makes a more effi­cient use of neu­rons. So evolu­tion traded away some ra­tio­nal­ity for greater effi­ciency in the en­cod­ing mechanism.

Valu­a­tion in the Brain

Back to dopamine. Ear­lier, we learned that the brain learns the val­ues of their ac­tions with a dopamin­er­gic re­ward sys­tem that uses some­thing like tem­po­ral differ­ence (TD) re­in­force­ment learn­ing. This re­ward sys­tem up­dates the stored val­ues for ac­tions by gen­er­at­ing a re­ward pre­dic­tion er­ror (RPE) from the differ­ence be­tween ex­pected re­ward and ex­pe­rience re­ward, and prop­a­gat­ing this learn­ing through­out rele­vant struc­tures of the brain us­ing the neu­ro­trans­mit­ter dopamine. In par­tic­u­lar, some synapses are strength­ened when­ever presy­nap­tic and post­sy­nap­tic ac­tivity oc­cur in the pres­ence of dopamine, as pro­posed by Wick­ens (1993).

But we haven’t yet dis­cussed how util­ities for ac­tions are gen­er­ated in the first place, or how they are stored (in­de­pen­dent of the ex­pected util­ities rep­re­sented dur­ing the choice pro­cess). It feels like I gen­er­ally want ice cream a lit­tle bit and hot sex a lot more. Where is that in­for­ma­tion stored?

Dozens41 of fMRI stud­ies show that two brain re­gions in par­tic­u­lar are cor­re­lated with sub­jec­tive value: the ven­tral stri­a­tum and the me­dial pre­frontal cor­tex. Other stud­ies sug­gest that at least five more brain re­gions prob­a­bly also con­tribute to the val­u­a­tion pro­cess: the or­bitofrontal cor­tex, the dor­so­lat­eral pre­frontal cor­tex, the amyg­dala, the in­sula, and the an­te­rior cin­gu­late cor­tex.

There are many the­o­ries about how the hu­man brain gen­er­ates and stores util­ities, but these the­o­ries are far more spec­u­la­tive and in their in­fancy than ev­ery­thing else I’ve pre­sented in this tu­to­rial, so I won’t dis­cuss them here. In­stead, let us con­clude with a sum­mary of what neu­ro­scien­tists know about the hu­man brain’s mo­ti­va­tional sys­tem, and what some of the great­est open ques­tions are.

Sum­mary and Re­search Directions

Here’s what we’ve learned:

  • Utilities are real num­bers rang­ing from 0 to 1,000 that take ac­tion po­ten­tials per sec­ond as their nat­u­ral units. (By ‘util­ity’ here I don’t mean what’s usu­ally meant by the term, I just mean ‘util­ity’ for the pur­pose of pre­dict­ing choice by mea­sur­ing the firing rates of cer­tain pop­u­la­tions of neu­rons in the fi­nal com­mon path of the choice cir­cuit in the hu­man brain.)

  • Mean util­ities are mean firing rates of spe­cific pop­u­la­tions of neu­rons in the fi­nal com­mon path of hu­man choice cir­cuits.

  • Mean util­ities pre­dict choice stochas­ti­cally, similar to ran­dom util­ity mod­els from eco­nomics.

  • Utilities are en­coded car­di­nally in firing rates rel­a­tive to neu­ronal baseline firing rates. (This is op­posed to post-Pareto, or­di­nal no­tions of util­ity.)

  • The choice cir­cuit takes as its in­put a firing rate that en­codes rel­a­tive (nor­mal­ized) stochas­tic ex­pected util­ity.

  • As the choice set size grows, so does the er­ror rate.

  • Fi­nal choice is im­ple­mented by an argmax func­tion or a reser­va­tion price mechanism.

Paul Glim­cher lists42 the great­est open ques­tions in the field as:

  1. Where is util­ity stored and how does it get to the choice mechanism?

  2. How does the brain de­cide when it’s time to choose?

  3. What is the neu­ral mechanism that al­lows us to sub­sti­tute be­tween two goods at a cer­tain point?

  4. How are prob­a­bil­is­tic be­liefs rep­re­sented in the brain?

  5. Utility func­tions are state-de­pen­dent, so how do state and util­ity func­tion in­ter­act?

Later, we’ll ex­plore the im­pli­ca­tions of our find­ings for metaethics. As of Au­gust 2011, if you’ve read this then you prob­a­bly know more about how hu­man val­ues ac­tu­ally work than al­most ev­ery pro­fes­sional metaethi­cist on Earth. The gen­eral les­son here is that you can of­ten out-pace most philoso­phers sim­ply by read­ing what to­day’s lead­ing sci­en­tists have to say about a given topic in­stead of read­ing what philoso­phers say about it.


1 They are: Less Wrong Ra­tion­al­ity and Main­stream Philos­o­phy, Philos­o­phy: A Diseased Dis­ci­pline, On Be­ing Okay with the Truth, The Neu­ro­science of Plea­sure, The Neu­ro­science of De­sire, How You Make Judg­ments: The Elephant and its Rider, Be­ing Wrong About Your Own Sub­jec­tive Ex­pe­rience, In­tu­ition and Un­con­scious Learn­ing, In­fer­ring Our De­sires, Wrong About Our Own De­sires, Do Hu­mans Want Things?, Not for the Sake of Plea­sure Alone, Not for the Sake of Selfish­ness Alone, Your Evolved In­tu­itions, When In­tu­itions Are Use­ful, Cor­nell Real­ism, Rail­ton’s Mo­ral Re­duc­tion­ism (Part 1), Rail­ton’s Mo­ral Re­duc­tion­ism (Part 2), Jack­son’s Mo­ral Func­tion­al­ism, Mo­ral Re­duc­tion­ism and Moore’s Open Ques­tion Ar­gu­ment, and Are Deon­tolog­i­cal Mo­ral Judg­ments Ra­tion­al­iza­tions?

2 Head­ing Toward: No-Non­sense Me­taethics, What is Me­taethics?, Con­cep­tual Anal­y­sis and Mo­ral The­ory, and Plu­ral­is­tic Mo­ral Re­duc­tion­ism.

3 I tried some­thing similar be­fore, with Cog­ni­tive Science in One Les­son.

4 Glim­cher (2010) offers the best cov­er­age of the topic in a sin­gle book. Tobler & Kobayashi (2009) offer the best cov­er­age in a sin­gle ar­ti­cle.

5 The quotes in this sec­tion are from Church­land (1981).

6 Allen & Ng (2004).

7 This per­spec­tive goes back at least as far back as Ar­nauld (1662), who wrote:

To judge what one must do to ob­tain a good or avoid an evil, it is nec­es­sary to con­sider not only the good and the evil in it­self, but also the prob­a­bil­ity that it hap­pens or does not hap­pen: and to view ge­o­met­ri­cally the pro­por­tion that all these things have to­gether.

8 In ad­di­tion to Caplin & Leahy (2001), see Kreps & Por­teus’ (1978, 1979) in­cro­p­o­ra­tion of the “util­ity of know­ing”, Loomes & Sug­den’s (1982) in­cor­po­ra­tion of “re­gret”, Gul & Pe­sendorfer’s (2001) in­cor­po­ra­tion of “the cost of self-con­trol”, and Koszegi & Rabin’s (2007, 2009) in­cor­po­ra­tion of the “refer­ence point”.

9 Fried­man (1953).

10 See a re­view in Fox & Poldrack (2009).

11 For one difficulty with prospect the­ory, see Laury & Holt (2008).

12 Sut­ton & Barto (2008), p. 3. All quotes from this sec­tion are from the early pages of this book.

13 From Sut­ton & Barto (2008).

14 Much of the rest of this post is ba­si­cally a sum­mary and para­phrase of Glim­cher (2010).

15 Mirenow­icz & Schultz (1994).

16 Schultz et al. (1997).

17 Caplin & Dean (2007)

18 From Glim­cher (2010).

19 Hebb (1949).

20 Malenka & Bear (2004).

21 Reynolds & Wick­ens (2002).

22 Glim­cher (2010), p. 341.

23 Edel­man & Kel­ler (1996); Van Gis­ber­gen et al. (1987).

24 Gold and Shadlen (2007); Roit­man and Shadlen (2002).

25 Si­mon (1957).

26 Glim­cher (2010), p. 215.

27 McFad­den (2000). The be­hav­ior of grad­u­ally tran­si­tion­ing be­tween two choices is de­scribed by Selten (1975).

28 For a prob­a­bly im­proved ran­dom util­ity model, see Gul & Pe­sendorfer (2006).

29 Dean (1983); Werner & Mount­cas­tle (1963).

30 Un­less some other fea­ture of the brain turns out to ‘smooth out’ the stochas­tic­ity of neu­rons in­volved in val­u­a­tion and choice-mak­ing.

31 Glim­cher (2010).

32 Heeger (1992, 1993); Caran­dini & Heeger (1994); Si­mon­celli & Heeger (1998).

33 Caran­dini & Heeger (1994); Brit­ten & Heuer (1999); Zoc­colan et al. (2005); Louie & Glim­cher (2010).

34 Horow­itz & New­some (2001a, 2001b, 2004).

35 Liu & Wang (2008).

36 But, see Den­eve (2009).

37 Glim­cher (2010), p. 281.

38 Glim­cher (2010), p. 283.

39 This quote and the next quote are from Koszegi & Rabin (2006).

40 Glim­cher (2010), p. 292.

41 I won’t list them all here. For an overview, see Glim­cher (2010), ch. 14.

42 Glim­cher (2010), ch. 17. I’ve para­phrased his open ques­tions. I also ex­cluded his 6th ques­tion: What Is the Neu­ral Or­gan for Rep­re­sent­ing Money?


Allais (1953). Le com­porte­ment de l’homme ra­tionel de­vant le risque. Cri­tique des pos­tu­lates et ax­iomes de l’ecole amer­i­caine. Econo­met­rica, 21: 503-546.

Allen & Ng (2004). Eco­nomic be­hav­ior. In Spielberger (ed.), En­cy­clo­pe­dia of Ap­plied Psy­chol­ogy, Vol. 1 (pp. 661-666). Aca­demic Press.

Ar­nauld (1662). Port-Royal Logic.

Basso & Wurtz (1997). Mo­du­la­tion of neu­ronal ac­tivity in su­pe­rior col­licu­lus by changes in tar­get prob­a­bil­ity. Jour­nal of Neu­ro­science, 18: 7519-7534.

Brit­ten & Heuer (1999). Spa­tial sum­ma­tion in the re­cep­tive fields of MT neu­rons. Jour­nal of Neu­ro­science, 19: 5074-5084.

Caplin & Dean (2007). Ax­io­matic neu­roe­co­nomics.

Caplin, Dean, Glim­cher, & Rut­ledge (2010). Mea­sur­ing be­liefs and re­wards: a neu­roe­co­nomic ap­proach. Quar­terly Jour­nal of Eco­nomics, 125: 3.

Caplin & Leahy (2001). Psy­cholog­i­cal ex­pected util­ity the­ory and an­ti­ci­pa­tory feel­ings. Quar­terly Jour­nal of Eco­nomics, 116: 55-79.

Caran­dini & Heeger (1994). Sum­ma­tion and de­vi­sion by neu­rons in pri­mate vi­sual cor­tex. Science, 264: 1333-1336.

Church­land (1981). Elimi­na­tive ma­te­ri­al­ism and the propo­si­tional at­ti­tudes. The Jour­nal of Philos­o­phy, 78: 67-90.

Dean (1983). Adap­ta­tion-in­duced al­ter­a­tion of the re­la­tion be­tween re­sponse am­pli­tude and con­trast in cat stri­ate cor­ti­cal neu­rons. Vi­sion Re­search, 23: 249-256.

Den­eve (2009). Bayesian de­ci­sion mak­ing in two-al­ter­na­tive forced choices. In Dre­her & Trem­blay (eds.), Hand­book of Re­ward and De­ci­sion Mak­ing (pp. 441-458). Aca­demic Press.

Dor­ris & Glim­cher (2004). Ac­tivity in pos­te­rior pari­etal cor­tex is cor­re­lated with the sub­jec­tive de­sire­abil­ity of an ac­tion. Neu­ron, 44: 365-378.

Edel­man & Kel­ler (1996). Ac­tivity of vi­suo­mo­tor burst neu­rons in the su­pe­rior col­licu­lus ac­com­pa­ny­ing ex­press sac­cades. Jour­nal of Neu­ro­phys­iol­ogy, 76: 908-926.

Fox & Poldrack (2009). Prospect the­ory and the brain. In Glim­cher, Camerer, Fehr, & Poldrack (eds.), Neu­roe­co­nomics: De­ci­sion Mak­ing and the Brain (pp. 145-173). Aca­demic Press.

Fried­man (1953). Es­says in Pos­i­tive Eco­nomics. Univer­sity of Chicago Press.

Glim­cher (2010). Foun­da­tions of Neu­roe­co­nomic Anal­y­sis. Oxford Univer­sity Press.

Gold and Shadlen (2007). The neu­ral ba­sis of de­ci­sion mak­ing. An­nual Re­view of Neu­ro­science, 30: 535-574.

Gul & Pe­sendorfer (2001). Temp­ta­tion and self-con­trol. Econo­met­rica, 69: 1403-1435.

Gul & Pe­sendorfer (2006). Ran­dom ex­pected util­ity. Econo­met­rica, 74: 121-146.

Hebb (1949). The or­ga­ni­za­tion of be­hav­ior. Wiley & Sons.

Heeger (1992). Nor­mal­iza­tion of cell re­sponses in cat stri­ate cor­tex. Vi­sual Neu­ro­science, 9: 181-197.

Heeger (1993). Model­ing sim­ple-cell di­rec­tion se­lec­tivity with nor­mal­ized, half-squared lin­ear op­er­a­tors. Jour­nal of Neu­ro­phys­iol­ogy, 70: 1885-1898.

Horow­itz & New­some (2001a). Tar­get se­lec­tion for sac­cadic eye move­ments: di­rec­tion se­lec­tive vi­sual re­sponses in the su­pe­rior col­licu­lus in­duced by be­hav­ioral train­ing. Jour­nal of Neu­ro­phys­iol­ogy, 86: 2527-2542.

Horow­itz & New­some (2001b). Tar­get se­lec­tion for sac­cadic eye move­ments: pre­lude ac­tivity in the su­pe­rior col­licu­lus dur­ing a di­rec­tion dis­crim­i­na­tion task. Jour­nal of Neu­ro­phys­iol­ogy, 86: 2543-2558.

Iyen­gar & Lep­per (2000). When choice is de­mo­ti­vat­ing: Can one de­sire too much of a good thing? Jour­nal of Per­son­al­ity and So­cial Psy­chol­ogy, 79: 995-1006.

Jevons (1871). The The­ory of Poli­ti­cal Econ­omy. Macmil­lan and Co.

Kah­ne­man & Tver­sky (1979). Prospect the­ory: An anal­y­sis of de­ci­sion un­der risk. Econo­met­rica, 47: 263-291.

Koszegi & Rabin (2006). A model of refer­ence-de­pen­dent prefer­ences. Quar­terly Jour­nal of Eco­nomics, 121: 1133-1165.

Koszegi & Rabin (2007). Refer­ence-de­pen­dent risk at­ti­tudes. Amer­i­can Eco­nomic Re­view, 97: 1047-1073.

Koszegi & Rabin (2009). Refer­ence-de­pen­dent con­sump­tion plans. Amer­i­can Eco­nomic Re­view, 99: 909-936.

Kreps & Por­teus (1978). Tem­po­ral re­s­olu­tion of un­cer­tainty and dy­namic choice the­ory. Econo­met­rica, 46: 185-200.

Kreps & Por­teus (1979). Dy­namic choice the­ory and dy­namic pro­gram­ming. Econo­met­rica, 47: 91-100.

Laury & Holt (2008). Pay­off scale effects and risk prefer­ence un­der real and hy­po­thet­i­cal con­di­tions. In Plott & Smith (eds.), Hand­book of Ex­per­i­men­tal Eco­nomic Re­sults, Vol. 1 (pp. 1047-1053). El­se­vier Press.

Loewen­stein (1987). An­ti­ci­pa­tion and the val­u­a­tion of de­layed con­sump­tion. Eco­nomic Jour­nal, 97: 666-684.

Liu & Wang (2008). A com­mon cor­ti­cal cir­cuit mechanism for per­cep­tual cat­e­gor­i­cal dis­crim­i­na­tion and veridi­cal judg­ment. PLOS Com­pu­ta­tional Biol­ogy, 4: 1-14.

Loomes & Sug­den (1982). Re­gret the­ory: An al­ter­na­tive the­ory of ra­tio­nal choice un­der un­cer­tainty. Eco­nomic Jour­nal, 92: 805-824.

Louie & Glim­cher (2010). Separat­ing value from choice: de­lay dis­count­ing ac­tivity in the lat­eral in­tra­pari­etal area. Jour­nal of Neu­ro­science, 30: 5498-5507.

Malenka & Bear (2004). LTP and LTD: an em­bar­rass­ment of riches. Neu­ron, 44: 5–21.

Mirenow­icz & Schultz (1994). Im­por­tance of un­pre­dictabil­ity for re­ward re­sponses in pri­mate dopamine neu­rons. Jour­nal of Neu­ro­phys­iol­ogy, 72: 1024-1027.

Reynolds & Wick­ens (2002). Dopamine-de­pen­dent plas­tic­ity of cor­ti­cos­tri­atal synapses. Neu­ral Net­works, 15: 507-521.

Roit­man and Shadlen (2002). Re­sponse of neu­rons in the lat­eral in­tra­pari­etal area dur­ing a com­bined vi­sual dis­crim­i­na­tion re­ac­tion time task. Na­ture Neu­ro­science, 22: 9475-9489.

Scheibe­henne, Greifeneder, & Todd (2010). Can there ever be too many op­tions? A meta-an­a­lytic re­view of choice over­load. Jour­nal of Con­sumer Re­search, 37: 409-425.

Schultz, Dayan, & Mon­tague (1997). A neu­ral sub­strate of pre­dic­tion and re­ward. Science, 275: 1593–1599.

Schwartz & Si­mon­celli (2001). Nat­u­ral sig­nal statis­tics and sen­sory gain con­trol. Na­ture Neu­ro­science, 4: 819-825.

Selten (1975). Re­ex­am­i­na­tion of perfect­ness con­cept for equil­ibrium points in ex­ten­sive games. In­ter­na­tional Jour­nal of Game The­ory, 4: 25-55.

Si­mon­celli & Heeger (1998). A model of neu­ronal re­sponses in vi­sual area MT. Vi­sion Re­search, 38: 743-761.

Sut­ton & Barto (2008). Re­in­force­ment Learn­ing: An In­tro­duc­tion. MIT Press.

Tanji & Evarts (1976). An­ti­ci­pa­tory ac­tivity of mo­tor cor­tex neu­rons in re­la­tion to di­rec­tion of an in­tended move­ment. Jour­nal of Neu­ro­phys­iol­ogy, 39: 1062-1068.

Tobler & Kobayashi (2009). Elec­tro­phys­iolog­i­cal cor­re­lates of re­ward pro­cess­ing in dopamine neu­rons. In Dre­her & Trem­blay (eds.), Hand­book of Re­ward and De­ci­sion Mak­ing (pp. 29-50). Aca­demic Press.

Van Gis­ber­gen, Op­stal, & Tax (1987). Col­licu­lar en­sem­ble cod­ing of sac­cades based on vec­tor sum­ma­tion. Neu­ro­science, 21: 651.

Werner & Mount­cas­tle (1963). The vari­abil­ity of cen­tral neu­ral ac­tivity in a sen­sory sys­tem, and its im­pli­ca­tions for cen­tral re­flec­tion of sen­sory events. Jour­nal of Neu­ro­phys­iol­ogy, 26: 958-977.

Wick­ens (1993). A The­ory of the Stri­a­tum. Perg­a­mon Press.

Zoc­colan, Cox, & DiCarlo (2005). Mul­ti­ple ob­ject re­sponse nor­mal­iza­tion in mon­key in­ferotem­po­ral cor­tex. Jour­nal of Neu­ro­science, 25: 8150-8164.