On the Role of Coun­ter­fac­tu­als in Learning

The fol­low­ing is a hy­po­thesis re­gard­ing the pur­pose of coun­ter­fac­tual reas­on­ing (par­tic­u­larly in hu­mans). It builds on Judea Pearl’s three-rung Lad­der of Caus­a­tion (see be­low).

One im­port­ant takeaway from this hy­po­thesis is that coun­ter­fac­tu­als really only make sense in the con­text of com­pu­ta­tion­ally bounded agents.

(Fig­ure taken from The Book of Why [Chapter 1, p. 6].)


Coun­ter­fac­tu­als provide ini­tial­iz­a­tions for use in MCMC sampling.

Pre­lim­in­ary Definitions

As­so­ci­ation (model-free):

In­ter­ven­tion/​Hy­po­thet­ical (model-based):

Coun­ter­fac­tual (model-based):

In the coun­ter­fac­tual, we have already ob­served an out­come but wish to reason about the prob­ab­il­ity of ob­serving an­other out­come (pos­sibly the same as ) un­der .

Note: Below, I use the terms “model” and “causal net­work” in­ter­change­ably. Also, an “ex­per­i­ence” is an ob­ser­va­tion of a causal net­work in ac­tion.


  1. Real-world sys­tems are highly com­plex, of­ten with many causal factors in­flu­en­cing sys­tem dy­nam­ics.

  2. Hu­mans minds are com­pu­ta­tion­ally bounded (in time, memory, and pre­ci­sion).

  3. Hu­mans do not nat­ur­ally think in terms of con­tinu­ous prob­ab­il­it­ies; they think in terms of dis­crete out­comes and their re­l­at­ive like­li­hoods.

Rel­ev­ant Lit­er­at­ure:

Lieder, F., Grif­fiths, T. L., Huys, Q. J., & Good­man, N. D. (2018). The an­chor­ing bias re­flects ra­tional use of cog­nit­ive re­sources. Psy­cho­nomic bul­letin & re­view, 25(1), 322-349.

San­born, A. N., & Chater, N. (2016). Bayesian brains without prob­ab­il­it­ies. Trends in cog­nit­ive sci­ences, 20(12), 883-893.


Claim 1.

From a nota­tional per­spect­ive, in go­ing from a hy­po­thet­ical to a coun­ter­fac­tual, the gen­er­al­iz­a­tion lies solely in the abil­ity to reason about a con­crete scen­ario start­ing from an al­tern­at­ive scen­ario (the coun­ter­fac­tual). In the­ory, given in­fin­ite com­pu­ta­tional re­sources, the do-op­er­ator can, on its own, reason for­ward about any­thing by con­sid­er­ing only hy­po­thet­ic­als. Thus, a coun­ter­fac­tual would be an in­ad­miss­ible ob­ject un­der such cir­cum­stances. (Per­fect know­ledge of the sys­tem is not re­quired if one can spe­cify a prior. All that is re­quired is suf­fi­cient com­pu­ta­tional re­sources.)

Corol­lary 1.1.

Coun­ter­fac­tu­als are only use­ful when op­er­at­ing with lim­ited com­pu­ta­tional re­sources, where “lim­ited” is defined re­l­at­ive to the agent do­ing the reas­on­ing and the con­straints they face (e.g., lim­ited time to make a de­cision, in­ab­il­ity to hold enough items in memory, and any such com­bin­a­tions of these con­straints).

Corol­lary 1.2.

If model-based hy­po­thet­ical reas­on­ing (i.e. “sim­u­lat­ing”) is a suf­fi­cient tool to re­solve all hu­man de­cisions, then all of our ex­per­i­ences/​ob­ser­va­tions should go to­ward build­ing a model that is as ac­cess­ible and ac­cur­ate as pos­sible, given our com­pu­ta­tional lim­it­a­tions.

By As­sump­tion 1, the vast ma­jor­ity of hu­man de­cision-mak­ing the­or­et­ic­ally con­sists in reas­on­ing about a “large” num­ber of causal in­ter­ac­tions at once, where “large” here means an amount that is bey­ond the bounds of the hu­man mind (As­sump­tion 2). Thus, by Claim 1, we are in the re­gime where coun­ter­fac­tu­als are use­ful. But in what way are they use­ful?

By Corol­lary 1.2, we wish to build a use­ful model based upon our ex­per­i­ences. A use­ful model is one that is as pre­dict­ively ac­cur­ate as pos­sible while still be­ing ac­cess­ible (i.e. in­ter­pretable) by the hu­man mind. Given that: (1) a model is de­scrib­able as data, (2) the most data can be stored in our brains in the form of long-term memory, and (3) the max­imal pre­dict­ive ac­cur­acy of a model is a non-de­creas­ing func­tion of its de­scrip­tion length, then a max­im­ally pre­dict­ive model is one that is stored in our long-term memory. However, hu­man work­ing memory is lim­ited in ca­pa­city re­l­at­ive to long-term memory.

Claim 2.

The above are com­pet­ing factors: A more de­script­ive (and pre­dict­ive) model (rep­res­en­ted by more data) may fit in long-term memory, but due to a lim­ited work­ing memory, it may be in­ac­cess­ible (at least in a way that lever­ages its full cap­ab­il­it­ies). Thus, at­ten­tional mech­an­isms are re­quired to guide our re­trieval of sub­com­pon­ents of the full model to load into work­ing memory.

Again, by As­sump­tions 1, 2, our mod­els are ap­prox­im­ate — both in­ac­cur­ate and in­com­plete. Thus, we wish to im­prove our mod­els by in­teg­rat­ing over our en­tire ex­per­i­ences. This equates to com­put­ing the fol­low­ing pos­terior dis­tri­bu­tion:

By As­sump­tion 3, hu­mans can­not com­pute up­dates to their pri­ors ac­cord­ing to the above for­mula.

Claim 3.

Hu­mans do some­thing akin to MCMC sampling to ap­prox­im­ate the above pos­terior. Be­cause MCMC meth­ods (e.g., Gibbs sampling, Met­ro­polis-Hast­ings) sys­tem­at­ic­ally ex­plore the space of mod­els in a local and in­cre­mental man­ner (e.g., by con­di­tion­ing on all but one vari­able in Gibbs sampling, or by tak­ing local steps in model space in Met­ro­polis-Hast­ings) AND only re­quire reas­on­ing via like­li­hood ra­tios (As­sump­tion 3), we can over­come the con­straints im­posed by our lim­ited work­ing memory and still man­age to up­date mod­els that fit in long-term memory but not en­tirely in work­ing memory.

MCMC meth­ods re­quire ini­tial­iz­a­tion (i.e. a sample to start from).

Claim 4.

Coun­ter­fac­tu­als provide this ini­tial­iz­a­tion. Given that our model is built up en­tirely of true samples of the world, our aims is to in­ter­pol­ate between these samples. (We don’t really have a prior at birth on the ground-truth causal net­work on which the world op­er­ates.) Thus, we can only trust our model with 100% cred­ib­il­ity at ob­served samples. Fur­ther­more, by As­sump­tion 2, we are pres­sured to min­im­ize time to con­ver­gence of any MCMC method. Hence, the best we can do is to be­gin the MCMC sampling pro­ced­ure start­ing from a point that we know be­longs in the sup­port of the dis­tri­bu­tion (and likely in a re­gion of high dens­ity).

From the Met­ro­polis-Hast­ings Wiki­pe­dia:

Al­though the Markov chain even­tu­ally con­verges to the de­sired dis­tri­bu­tion, the ini­tial samples may fol­low a very dif­fer­ent dis­tri­bu­tion, es­pe­cially if the start­ing point is in a re­gion of low dens­ity. As a res­ult, a burn-in period is typ­ic­ally ne­ces­sary.

Coun­ter­fac­tu­als al­low us to avoid the need for any costly burn-in phase.