Protein Reinforcement and DNA Consequentialism

Fol­lowup to: Evolu­tion­ary Psychology

It takes hun­dreds of gen­er­a­tions for a sim­ple benefi­cial mu­ta­tion to pro­mote it­self to uni­ver­sal­ity in a gene pool. Thou­sands of gen­er­a­tions, or even mil­lions, to cre­ate com­plex in­ter­de­pen­dent ma­chin­ery.

That’s some slow learn­ing there. Let’s say you’re build­ing a squir­rel, and you want the squir­rel to know lo­ca­tions for find­ing nuts. In­di­vi­d­ual nut trees don’t last for the thou­sands of years re­quired for nat­u­ral se­lec­tion. You’re go­ing to have to learn us­ing pro­teins. You’re go­ing to have to build a brain.

Protein com­put­ers and sen­sors can learn by look­ing, much faster than DNA can learn by mu­ta­tion and se­lec­tion. And yet (un­til very re­cently) the pro­tein learn­ing ma­chines only learned in nar­row, spe­cific do­mains. Squir­rel brains learn to find nut trees, but not to build gliders—as fly­ing squir­rel DNA is slowly learn­ing to do. The pro­tein com­put­ers learned faster than DNA, but much less gen­er­ally.

How the heck does a dou­ble-stranded molecule that fits in­side a cell nu­cleus, come to em­body truths that baf­fle a whole damn squir­rel brain?

Con­sider the high-falutin’ ab­stract think­ing that mod­ern evolu­tion­ary the­o­rists do in or­der to un­der­stand how adap­ta­tions in­crease in­clu­sive ge­netic fit­ness. Re­cip­ro­cal al­tru­ism, evolu­tion­ar­ily sta­ble strate­gies, de­ter­rence, costly sig­nal­ing, sex­ual se­lec­tion—how many hu­mans ex­plic­itly rep­re­sent this knowl­edge? Yet DNA can learn it with­out a pro­tein com­puter.

There’s a long chain of causal­ity whereby a male squir­rel, eat­ing a nut to­day, pro­duces more offspring months later: Chew­ing and swal­low­ing food, to di­gest­ing food, to burn­ing some calories to­day and turn­ing oth­ers into fat, to burn­ing the fat through the win­ter, to sur­viv­ing the win­ter, to mat­ing with a fe­male, to the sperm fer­til­iz­ing an egg in­side the fe­male, to the fe­male giv­ing birth to an offspring that shares 50% of the squir­rel’s genes.

With the sole ex­cep­tion of hu­mans, no pro­tein brain can imag­ine chains of causal­ity that long, that ab­stract, and cross­ing that many do­mains. With one ex­cep­tion, no pro­tein brain is even ca­pa­ble of draw­ing the con­se­quen­tial link from chew­ing and swal­low­ing to in­clu­sive re­pro­duc­tive fit­ness.

Yet nat­u­ral se­lec­tion ex­ploits links be­tween lo­cal ac­tions and dis­tant re­pro­duc­tive benefits. In wide gen­er­al­ity, across do­mains, and through lev­els of ab­strac­tion that con­fuse some hu­mans. Be­cause—of course—the ba­sic evolu­tion­ary idiom works through the ac­tual real-world con­se­quences, avoid­ing the difficulty of hav­ing a brain imag­ine them.

Nat­u­rally, this also misses the effi­ciency of hav­ing a brain imag­ine con­se­quences. It takes mil­lions of years and billions of dead bod­ies to build com­plex ma­chines this way. And if you want to mem­o­rize the lo­ca­tion of a nut tree, you’re out of luck.

Grad­u­ally DNA ac­quired the abil­ity to build pro­tein com­put­ers, brains, that could learn small mod­u­lar facets of re­al­ity like the lo­ca­tion of nut trees. To call these brains “limited” im­plies that a speed limit was tacked onto a gen­eral learn­ing de­vice, which isn’t what hap­pened. It’s just that the in­cre­men­tal suc­cesses of par­tic­u­lar mu­ta­tions tended to build out into do­main-spe­cific nut-tree-map­ping pro­grams. (If you know how to pro­gram, you can ver­ify for your­self that it’s eas­ier to build a nut-tree-map­per than an Ar­tifi­cial Gen­eral In­tel­li­gence.)

One idiom that brain-build­ing DNA seems to have hit on, over and over, is re­in­force­ment learn­ing—re­peat­ing poli­cies similar to poli­cies pre­vi­ously re­warded. If a food con­tains lots of calories and doesn’t make you sick, then eat more foods that have similar tastes. This doesn’t re­quire a brain that vi­su­al­izes the whole chain of di­ges­tive causal­ity.

Re­in­force­ment learn­ing isn’t triv­ial: You’ve got to chop up taste space into neigh­bor­hoods of similar­ity, and stick a sen­sor in the stom­ach to de­tect calories or in­di­ges­tion, and do some kind of long-term-po­ten­ti­a­tion that strength­ens the eat­ing im­pulse. But it seems much eas­ier for evolu­tion to hit on re­in­force­ment learn­ing, than a brain that ac­cu­rately vi­su­al­izes the di­ges­tive sys­tem, let alone a brain that ac­cu­rately vi­su­al­izes the re­pro­duc­tive con­se­quences N months later.

(This effi­ciency does come at a price: If the en­vi­ron­ment changes, mak­ing food no longer scarce and famines im­prob­a­ble, the or­ganisms may go on eat­ing food un­til they ex­plode.)

Similarly, a bird doesn’t have to cog­ni­tively model the airflow over its wings. It just has to track which wing-flap­ping poli­cies cause it to lurch.

Why not learn to like food based on re­pro­duc­tive suc­cess, so that you’ll stop lik­ing the taste of candy if it stops lead­ing to re­pro­duc­tive suc­cess? Why don’t birds wait and see which wing-flap­ping poli­cies re­sult in more eggs, not just more sta­bil­ity?

Be­cause it takes too long. Re­in­force­ment learn­ing still re­quires you to wait for the de­tected con­se­quences be­fore you learn.

Now, if a pro­tein brain could imag­ine the con­se­quences, ac­cu­rately, it wouldn’t need a re­in­force­ment sen­sor that waited for them to ac­tu­ally hap­pen.

Put a food re­ward in a trans­par­ent box. Put the cor­re­spond­ing key, which looks unique and uniquely cor­re­sponds to that box, in an­other trans­par­ent box. Put the key to that box in an­other box. Do this with five boxes. Mix in an­other se­quence of five boxes that doesn’t lead to a food re­ward. Then offer a choice of two keys, one which starts the se­quence of five boxes lead­ing to food, one which starts the se­quence lead­ing nowhere.

Chim­panzees can learn to do this. (Dohl 1970.) So con­se­quen­tial­ist rea­son­ing, back­ward chain­ing from goal to ac­tion, is not strictly limited to Homo sapi­ens.

But as far as I know, no non-pri­mate species can pull that trick. And work­ing with a few trans­par­ent boxes is noth­ing com­pared to the kind of high-falutin’ cross-do­main rea­son­ing you would need to causally link food to in­clu­sive fit­ness. (Never mind link­ing re­cip­ro­cal al­tru­ism to in­clu­sive fit­ness). Re­in­force­ment learn­ing seems to evolve a lot more eas­ily.

When nat­u­ral se­lec­tion builds a di­gestible-calorie-sen­sor linked by re­in­force­ment learn­ing to taste, then the DNA it­self em­bod­ies the im­plicit be­lief that calories lead to re­pro­duc­tion. So the long-term, com­pli­cated, cross-do­main, dis­tant link from calories to re­pro­duc­tion, is learned by nat­u­ral se­lec­tion—it’s im­plicit in the re­in­force­ment learn­ing mechanism that uses calories as a re­ward sig­nal.

Only short-term con­se­quences, which the pro­tein brains can quickly ob­serve and eas­ily learn from, get hooked up to pro­tein learn­ing. The DNA builds a pro­tein com­puter that seeks calories, rather than, say, chewiness. Then the pro­tein com­puter learns which tastes are caloric. (Over­sim­plified, I know. Lots of in­duc­tive hints em­bed­ded in this ma­chin­ery.)

But the DNA had bet­ter hope that its pro­tein com­puter never ends up in an en­vi­ron­ment where calories are bad for it… or where sex­ual plea­sure stops cor­re­lat­ing to re­pro­duc­tion… or where there are mar­keters that in­tel­li­gently re­verse-en­g­ineer re­ward sig­nals...