# Counterfactuals, thick and thin

Sum­mary: There’s a “thin” con­cept of coun­ter­fac­tual that’s easy to for­mal­ize and a “thick” con­cept that’s harder to for­mal­ize.

Sup­pose you’re try­ing to guess the out­come of a coin­flip. You guess heads, and the coin lands tails. Now you can ask how the coin would have landed if you had guessed tails. The ob­vi­ous an­swer is that it would still have landed tails. One way to think about this is that we have two vari­ables, your guess and the coin , that are in­de­pen­dent in some sense; so we can coun­ter­fac­tu­ally vary while keep­ing con­stant.

But con­sider the vari­able XOR . If we change to tails and keep the same, we con­clude that if we had guessed tails, the coin would have landed heads!

Now this is clearly silly. In real life, we have a causal model of the world that tells us that the first coun­ter­fac­tual is cor­rect. But we don’t have any­thing like that for log­i­cal un­cer­tainty; the best we have is log­i­cal in­duc­tion, which just give us a joint dis­tri­bu­tion. Given a joint dis­tri­bu­tion over , there’s no rea­son to pre­fer hold­ing con­stant rather than hold­ing XOR con­stant. I want a thin con­cept of coun­ter­fac­tu­als that in­cludes both choices. Here are a few defi­ni­tions, in in­creas­ing gen­er­al­ity:

1. Given in­de­pen­dent dis­crete ran­dom vari­ables and , such that is uniform, a thin coun­ter­fac­tual is a choice of per­mu­ta­tion of for ev­ery .

2. Given a joint dis­tri­bu­tion over and , a thin coun­ter­fac­tual is a ran­dom vari­able in­de­pen­dent of and an iso­mor­phism of prob­a­bil­ity spaces that com­mutes with the pro­jec­tion to .

3. Given a prob­a­bil­ity space and a prob­a­bil­ity ker­nel , a thin coun­ter­fac­tual is a prob­a­bil­ity space and a ker­nel such that .

There are of­ten mul­ti­ple choices of thin coun­ter­fac­tual. When we say that one of the thin coun­ter­fac­tu­als is more nat­u­ral or bet­ter than the oth­ers, we are us­ing a thick con­cept of coun­ter­fac­tu­als. Pearl’s con­cept of coun­ter­fac­tu­als is a thick one. No one has yet for­mal­ized a thick con­cept of coun­ter­fac­tu­als in the set­ting of log­i­cal un­cer­tainty.

• The ques­tion “how would the coin have landed if I had guessed tails?” seems to me like a rea­son­ably well-defined phys­i­cal ques­tion about how ac­cu­rately you can flip a coin with­out hav­ing the re­sult be af­fected by ran­dom noise such as some­one say­ing “heads” or “tails” (as well as quan­tum fluc­tu­a­tions). It’s not clear to me what the an­swer to this ques­tion is, though I would guess that the coin’s coun­ter­fac­tual prob­a­bil­ity of land­ing heads is some­where strictly be­tween 0% and 50%.

• Oh, in­ter­est­ing. Would your in­ter­pre­ta­tion be differ­ent if the guess oc­curred well af­ter the coin­flip (but be­fore we get to see the coin­flip)?

• I agree that is is a well-defined ques­tion, though not eas­ily an­swered with­out know­ing how guess­ing phys­i­cally af­fects flip­ping the coin, read­ing the re­sults (hu­mans are no­to­ri­ously prone to mak­ing mis­takes like that) and so on. But I sus­pect that Nisan is ask­ing some­thing else, though I am not quite sure what. The post says

In real life, we have a causal model of the world that tells us that the first coun­ter­fac­tual is cor­rect. But we don’t have any­thing like that for log­i­cal un­cer­tainty; the best we have is log­i­cal in­duc­tion, which just give us a joint dis­tri­bu­tion.

I am not sure how phys­i­cal un­cer­tainty is differ­ent from log­i­cal un­cer­tainty, maybe there are some stan­dard ex­am­ples there that could help the un­ini­ti­ated like my­self.

• If we have an or­der­ing over log­i­cal sen­tences, such that we can look at two sen­tences and de­ter­mine at most one of (A is sim­pler than B), (B is sim­pler than A), then it seems nat­u­ral to priv­ilege the coun­ter­fac­tual that keeps the sim­pler term con­stant (and likely that this or­der­ing is such that you never have to choose be­tween coun­ter­fac­tu­als at the same level of sim­plic­ity).

This doesn’t fully solve the prob­lem—now I have a con­cept of thick­ness that’s pred­i­cated on an or­der­ing, and the or­der­ing is (in some sense) ar­bi­trary for the rea­sons noted el­se­where (I could define B = A xor C as the ground term, which makes A = B xor C now a com­pos­ite term). But it seems (to me) like the im­por­tant thing is be­ing able to build a model that doesn’t al­low cycli­cal be­hav­ior at all. After­wards, one can check to see whether or not the or­der­ing mat­ters (and if so, try to figure out the crite­ria that make for a good or­der­ing), or view it as ar­bi­trary in ap­prox­i­mately the way that ax­iom sets are ar­bi­trary.

• Thanks for this post, it’s re­ally helpful. I would re­ally like to un­der­stand the maths in this post, is there any­where which de­scribes this in more de­tail? In par­tic­u­lar, I can’t fol­low:

• Why are prob­a­bil­ities be­ing per­mu­tated?

• What kind of ker­nel are you refer­ring to?

• The defi­ni­tion in­volv­ing the per­mu­ta­tion is a gen­er­al­iza­tion of the ex­am­ple ear­lier in the post: is the iden­tity and swaps heads and tails. And . In gen­eral, if you ob­serve and , then the coun­ter­fac­tual state­ment is that if you had ob­served , then you would have also ob­served .

I just learned about prob­a­bil­ity ker­nels thanks to user Diffrac­tor. I might be us­ing them wrong.

• I don’t know enough math to un­der­stand whether you’ve cov­ered this in your ex­am­ples, but here’s my in­tu­ition in the form of typ­ing with­out a lot of re­flec­tion or edit­ing okay dis­claimer over:

If we have two vari­ables, A and C, and we’re con­sid­er­ing A, C, and (A xor C), it sounds to me like we’ve priv­ileged things ar­bi­trar­ily in some sense… re­la­bel­ing them A, B, and C it’s clear that we could have pivoted to con­sider any two of them the “base” vari­ables and the third the “xor’d” vari­able, so there should be no preferred coun­ter­fac­tual. It’s a loopy cause, a causal di­a­gram that’s not a DAG. Which doesn’t show up IRL. Like go­ing back in time to kill grandpa.

But we of­ten pre­tend they oc­cur by ab­stract­ing time and say­ing steady-state is a thing (or steady-states, and we’re look­ing at the map of tran­si­tions) and then we get loops and start study­ing feed­back and what­not. But if you un­packed any of those loops you’d get a very-very-repet­i­tive DAG that looks a lot like the ini­tial di­a­gram copied over and over with one-way ar­rows from copy to copy.

Seems like there are three op­tions to deal with {A,B,C}. They are iso­mor­phic to each other, so in some sense we shouldn’t be able to say which coun­ter­fac­tu­als to use. We could:

• do our mod­el­ing rel­a­tive to a speci­fied im­posed or­der­ing of all vari­ables, which seems re­ally hard, or

• some­how calcu­late all pos­si­ble re­sults and av­er­age over per­mu­ta­tions, which seems ei­ther fac­to­ri­ally harder or much eas­ier de­pend­ing on Math!, or

• as­sume there is hid­den struc­ture, that A, B, and C are ab­strac­tions atop a real DAG, and use a not-known-to-me math­e­mat­ics of loopy cau­sa­tion to define some­thing other than coun­ter­fac­tu­als atop the vari­ables, call­ing coun­ter­fac­tu­als over A, B, C a sort of type er­ror.

• I’m not fol­low­ing what you’re say­ing about loopy cau­sa­tion. How are you con­struct­ing this graph?

• That sounds about right to me. I think peo­ple have taken stabs at look­ing for causal­ity-like struc­ture in logic, but they haven’t found any­thing use­ful.