On the Role of Counterfactuals in Learning

The following is a hypothesis regarding the purpose of counterfactual reasoning (particularly in humans). It builds on Judea Pearl’s three-rung Ladder of Causation (see below).

One important takeaway from this hypothesis is that counterfactuals really only make sense in the context of computationally bounded agents.

(Figure taken from The Book of Why [Chapter 1, p. 6].)

Summary

Counterfactuals provide initializations for use in MCMC sampling.

Preliminary Definitions

Association (model-free):

Pr (Y = y ∣ X = x)

Intervention/Hypothetical (model-based):

Pr (Y = y ∣ d o (X = x))

Counterfactual (model-based):

Pr (Y = y ∣ d o (X = x), Y = y^{'})

In the counterfactual, we have already observed an outcome $y^{'}$ but wish to reason about the probability of observing another outcome $y$ (possibly the same as $y^{'}$ ) under $d o (X = x)$ .

Note: Below, I use the terms “model” and “causal network” interchangeably. Also, an “experience” is an observation of a causal network in action.

Assumptions

Real-world systems are highly complex, often with many causal factors influencing system dynamics.
Humans minds are computationally bounded (in time, memory, and precision).
Humans do not naturally think in terms of continuous probabilities; they think in terms of discrete outcomes and their relative likelihoods.

Relevant Literature:

Lieder, F., Griffiths, T. L., Huys, Q. J., & Goodman, N. D. (2018). The anchoring bias reflects rational use of cognitive resources. Psychonomic bulletin & review, 25(1), 322-349.

Sanborn, A. N., & Chater, N. (2016). Bayesian brains without probabilities. Trends in cognitive sciences, 20(12), 883-893.

Theory

Claim 1.

From a notational perspective, in going from a hypothetical to a counterfactual, the generalization lies solely in the ability to reason about a concrete scenario starting from an alternative scenario (the counterfactual). In theory, given infinite computational resources, the do-operator can, on its own, reason forward about anything by considering only hypotheticals. Thus, a counterfactual would be an inadmissible object under such circumstances. (Perfect knowledge of the system is not required if one can specify a prior. All that is required is sufficient computational resources.)

Corollary 1.1.

Counterfactuals are only useful when operating with limited computational resources, where “limited” is defined relative to the agent doing the reasoning and the constraints they face (e.g., limited time to make a decision, inability to hold enough items in memory, and any such combinations of these constraints).

Corollary 1.2.

If model-based hypothetical reasoning (i.e. “simulating”) is a sufficient tool to resolve all human decisions, then all of our experiences/observations should go toward building a model that is as accessible and accurate as possible, given our computational limitations.

By Assumption 1, the vast majority of human decision-making theoretically consists in reasoning about a “large” number of causal interactions at once, where “large” here means an amount that is beyond the bounds of the human mind (Assumption 2). Thus, by Claim 1, we are in the regime where counterfactuals are useful. But in what way are they useful?

By Corollary 1.2, we wish to build a useful model based upon our experiences. A useful model is one that is as predictively accurate as possible while still being accessible (i.e. interpretable) by the human mind. Given that: (1) a model is describable as data, (2) the most data can be stored in our brains in the form of long-term memory, and (3) the maximal predictive accuracy of a model is a non-decreasing function of its description length, then a maximally predictive model is one that is stored in our long-term memory. However, human working memory is limited in capacity relative to long-term memory.

Claim 2.

The above are competing factors: A more descriptive (and predictive) model (represented by more data) may fit in long-term memory, but due to a limited working memory, it may be inaccessible (at least in a way that leverages its full capabilities). Thus, attentional mechanisms are required to guide our retrieval of subcomponents of the full model to load into working memory.

Again, by Assumptions 1, 2, our models are approximate — both inaccurate and incomplete. Thus, we wish to improve our models by integrating over our entire experiences. This equates to computing the following posterior distribution:

$Pr (c a u s a l n e t w o r k ∣ e x p e r i e n c e)$

= \frac{Pr (e x p e r i e n c e ∣ c a u s a l n e t w o r k) \times Pr (c a u s a l n e t w o r k)}{Pr (e x p e r i e n c e)}

By Assumption 3, humans cannot compute updates to their priors according to the above formula.

Claim 3.

Humans do something akin to MCMC sampling to approximate the above posterior. Because MCMC methods (e.g., Gibbs sampling, Metropolis-Hastings) systematically explore the space of models in a local and incremental manner (e.g., by conditioning on all but one variable in Gibbs sampling, or by taking local steps in model space in Metropolis-Hastings) AND only require reasoning via likelihood ratios (Assumption 3), we can overcome the constraints imposed by our limited working memory and still manage to update models that fit in long-term memory but not entirely in working memory.

MCMC methods require initialization (i.e. a sample to start from).

Claim 4.

Counterfactuals provide this initialization. Given that our model is built up entirely of true samples of the world, our aims is to interpolate between these samples. (We don’t really have a prior at birth on the ground-truth causal network on which the world operates.) Thus, we can only trust our model with 100% credibility at observed samples. Furthermore, by Assumption 2, we are pressured to minimize time to convergence of any MCMC method. Hence, the best we can do is to begin the MCMC sampling procedure starting from a point that we know belongs in the support of the distribution (and likely in a region of high density).

From the Metropolis-Hastings Wikipedia:

Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a burn-in period is typically necessary.

Counterfactuals allow us to avoid the need for any costly burn-in phase.