On the Role of Counterfactuals in Learning

The fol­low­ing is a hy­poth­e­sis re­gard­ing the pur­pose of coun­ter­fac­tual rea­son­ing (par­tic­u­larly in hu­mans). It builds on Judea Pearl’s three-rung Lad­der of Cau­sa­tion (see be­low).

One im­por­tant take­away from this hy­poth­e­sis is that coun­ter­fac­tu­als re­ally only make sense in the con­text of com­pu­ta­tion­ally bounded agents.

(Figure taken from The Book of Why [Chap­ter 1, p. 6].)


Coun­ter­fac­tu­als provide ini­tial­iza­tions for use in MCMC sam­pling.

Pre­limi­nary Definitions

As­so­ci­a­tion (model-free):

In­ter­ven­tion/​Hy­po­thet­i­cal (model-based):

Coun­ter­fac­tual (model-based):

In the coun­ter­fac­tual, we have already ob­served an out­come but wish to rea­son about the prob­a­bil­ity of ob­serv­ing an­other out­come (pos­si­bly the same as ) un­der .

Note: Below, I use the terms “model” and “causal net­work” in­ter­change­ably. Also, an “ex­pe­rience” is an ob­ser­va­tion of a causal net­work in ac­tion.


  1. Real-world sys­tems are highly com­plex, of­ten with many causal fac­tors in­fluenc­ing sys­tem dy­nam­ics.

  2. Hu­mans minds are com­pu­ta­tion­ally bounded (in time, mem­ory, and pre­ci­sion).

  3. Hu­mans do not nat­u­rally think in terms of con­tin­u­ous prob­a­bil­ities; they think in terms of dis­crete out­comes and their rel­a­tive like­li­hoods.

Rele­vant Liter­a­ture:

Lieder, F., Griffiths, T. L., Huys, Q. J., & Good­man, N. D. (2018). The an­chor­ing bias re­flects ra­tio­nal use of cog­ni­tive re­sources. Psy­cho­nomic bul­letin & re­view, 25(1), 322-349.

San­born, A. N., & Chater, N. (2016). Bayesian brains with­out prob­a­bil­ities. Trends in cog­ni­tive sci­ences, 20(12), 883-893.


Claim 1.

From a no­ta­tional per­spec­tive, in go­ing from a hy­po­thet­i­cal to a coun­ter­fac­tual, the gen­er­al­iza­tion lies solely in the abil­ity to rea­son about a con­crete sce­nario start­ing from an al­ter­na­tive sce­nario (the coun­ter­fac­tual). In the­ory, given in­finite com­pu­ta­tional re­sources, the do-op­er­a­tor can, on its own, rea­son for­ward about any­thing by con­sid­er­ing only hy­po­thet­i­cals. Thus, a coun­ter­fac­tual would be an in­ad­mis­si­ble ob­ject un­der such cir­cum­stances. (Perfect knowl­edge of the sys­tem is not re­quired if one can spec­ify a prior. All that is re­quired is suffi­cient com­pu­ta­tional re­sources.)

Corol­lary 1.1.

Coun­ter­fac­tu­als are only use­ful when op­er­at­ing with limited com­pu­ta­tional re­sources, where “limited” is defined rel­a­tive to the agent do­ing the rea­son­ing and the con­straints they face (e.g., limited time to make a de­ci­sion, in­abil­ity to hold enough items in mem­ory, and any such com­bi­na­tions of these con­straints).

Corol­lary 1.2.

If model-based hy­po­thet­i­cal rea­son­ing (i.e. “simu­lat­ing”) is a suffi­cient tool to re­solve all hu­man de­ci­sions, then all of our ex­pe­riences/​ob­ser­va­tions should go to­ward build­ing a model that is as ac­cessible and ac­cu­rate as pos­si­ble, given our com­pu­ta­tional limi­ta­tions.

By As­sump­tion 1, the vast ma­jor­ity of hu­man de­ci­sion-mak­ing the­o­ret­i­cally con­sists in rea­son­ing about a “large” num­ber of causal in­ter­ac­tions at once, where “large” here means an amount that is be­yond the bounds of the hu­man mind (As­sump­tion 2). Thus, by Claim 1, we are in the regime where coun­ter­fac­tu­als are use­ful. But in what way are they use­ful?

By Corol­lary 1.2, we wish to build a use­ful model based upon our ex­pe­riences. A use­ful model is one that is as pre­dic­tively ac­cu­rate as pos­si­ble while still be­ing ac­cessible (i.e. in­ter­pretable) by the hu­man mind. Given that: (1) a model is de­scrib­able as data, (2) the most data can be stored in our brains in the form of long-term mem­ory, and (3) the max­i­mal pre­dic­tive ac­cu­racy of a model is a non-de­creas­ing func­tion of its de­scrip­tion length, then a max­i­mally pre­dic­tive model is one that is stored in our long-term mem­ory. How­ever, hu­man work­ing mem­ory is limited in ca­pac­ity rel­a­tive to long-term mem­ory.

Claim 2.

The above are com­pet­ing fac­tors: A more de­scrip­tive (and pre­dic­tive) model (rep­re­sented by more data) may fit in long-term mem­ory, but due to a limited work­ing mem­ory, it may be in­ac­cessible (at least in a way that lev­er­ages its full ca­pa­bil­ities). Thus, at­ten­tional mechanisms are re­quired to guide our re­trieval of sub­com­po­nents of the full model to load into work­ing mem­ory.

Again, by As­sump­tions 1, 2, our mod­els are ap­prox­i­mate — both in­ac­cu­rate and in­com­plete. Thus, we wish to im­prove our mod­els by in­te­grat­ing over our en­tire ex­pe­riences. This equates to com­put­ing the fol­low­ing pos­te­rior dis­tri­bu­tion:

By As­sump­tion 3, hu­mans can­not com­pute up­dates to their pri­ors ac­cord­ing to the above for­mula.

Claim 3.

Hu­mans do some­thing akin to MCMC sam­pling to ap­prox­i­mate the above pos­te­rior. Be­cause MCMC meth­ods (e.g., Gibbs sam­pling, Metropo­lis-Hast­ings) sys­tem­at­i­cally ex­plore the space of mod­els in a lo­cal and in­cre­men­tal man­ner (e.g., by con­di­tion­ing on all but one vari­able in Gibbs sam­pling, or by tak­ing lo­cal steps in model space in Metropo­lis-Hast­ings) AND only re­quire rea­son­ing via like­li­hood ra­tios (As­sump­tion 3), we can over­come the con­straints im­posed by our limited work­ing mem­ory and still man­age to up­date mod­els that fit in long-term mem­ory but not en­tirely in work­ing mem­ory.

MCMC meth­ods re­quire ini­tial­iza­tion (i.e. a sam­ple to start from).

Claim 4.

Coun­ter­fac­tu­als provide this ini­tial­iza­tion. Given that our model is built up en­tirely of true sam­ples of the world, our aims is to in­ter­po­late be­tween these sam­ples. (We don’t re­ally have a prior at birth on the ground-truth causal net­work on which the world op­er­ates.) Thus, we can only trust our model with 100% cred­i­bil­ity at ob­served sam­ples. Fur­ther­more, by As­sump­tion 2, we are pres­sured to min­i­mize time to con­ver­gence of any MCMC method. Hence, the best we can do is to be­gin the MCMC sam­pling pro­ce­dure start­ing from a point that we know be­longs in the sup­port of the dis­tri­bu­tion (and likely in a re­gion of high den­sity).

From the Metropo­lis-Hast­ings Wikipe­dia:

Although the Markov chain even­tu­ally con­verges to the de­sired dis­tri­bu­tion, the ini­tial sam­ples may fol­low a very differ­ent dis­tri­bu­tion, es­pe­cially if the start­ing point is in a re­gion of low den­sity. As a re­sult, a burn-in pe­riod is typ­i­cally nec­es­sary.

Coun­ter­fac­tu­als al­low us to avoid the need for any costly burn-in phase.