I’ve been playing with causal inference lately, as one does.^[1] I was thinking of writing a more formal sequence about how to do causal discovery and model comparison, and I might still do that.^[2] Meanwhile, I’m going to start with a sort of informal diary of what I’m learning as I go.

I started a while ago, so this diary entry is coming partway into the process. I’ll go back and fill in the history if I write that formal sequence later. Here’s the little bit of backstory you need for today’s episode to make sense.

I had a data set about diabetes that I found in an ML data repository and I already played a lot with it over the last few weeks. I extracted a set of variables from it that seemed maybe causally connected and ran a bunch of analysis on them. In the end I was disappointed with that set; they all just seemed very entangled and I did not find any causal graph, or set of graphs, that seemed clearly better than the rest.

I started again with a different set of variables. This time, I picked ones that I believed were more likely to tell a story about the data—ideally, a story with at least one branch point, where the value of a variable would have downstream effects on other variables. Indeed, this time I did get clusters of causal graphs that fit the data far better than the rest, and in a way that made sense to me. Hurray!

Today’s tasks were to:

Refactor my code to work over any data set, with separate configuration, DAG-generation, DAG-scoring, and results-reporting modules;
Add some code to plot DAG fit over all possible DAGs, so I could see the clusters and identify what changed;
Create a synthetic test data set over a known DAG and run my code over it, to make sure everything was working right.

The code refactor was dull in the best possible way and went just fine. I won’t tell you about it.

Plotting DAG scores

When I started this morning, my code was printing out the total log evidence for the best causal model, and also showing me which edges in the DAG were contributing how much. It was also showing me the top 20 DAGs and their scores, so I had some sense for how the scores were falling off as I looked at worse causal models. The output looked like this:

Oriented graphs checked: 59049
Acyclic DAGs found:      29281
Top 20 DAGs
 rank   log_evidence  delta_from_best  num_edges                                                                                       dag
    1 −325462.096600         0.000000          9  MED->MS; MED->PROC; MED->TIME; PD->MED; PD->MS; PD->PROC; PD->TIME; TIME->MS; TIME->PROC
    2 −325473.205359       −11.108759          9  MED->MS; MED->TIME; PD->MED; PD->MS; PD->PROC; PD->TIME; PROC->MED; PROC->TIME; TIME->MS
    3 −325475.142194       −13.045594          9  MED->MS; MED->PROC; PD->MED; PD->MS; PD->PROC; PD->TIME; TIME->MED; TIME->MS; TIME->PROC
    4 −325475.185055       −13.088455          9  MED->MS; MED->PROC; MED->TIME; PD->MED; PD->MS; PD->PROC; PD->TIME; PROC->TIME; TIME->MS
    5 −325475.710778       −13.614178          9 MS->MED; MS->TIME; PD->MS; PD->PROC; PD->TIME; PROC->MED; PROC->MS; PROC->TIME; TIME->MED
    6 −325477.040158       −14.943557          9 MS->MED; MS->PROC; MS->TIME; PD->MS; PD->PROC; PD->TIME; PROC->MED; PROC->TIME; TIME->MED
    7 −325479.612751       −17.516151          9  MED->MS; MED->PD; MED->PROC; MED->TIME; PD->MS; PD->PROC; PD->TIME; TIME->MS; TIME->PROC
    8 −325483.293488       −21.196888          9  MED->MS; MED->TIME; PD->MED; PD->MS; PD->TIME; PROC->MED; PROC->PD; PROC->TIME; TIME->MS
    9 −325485.798907       −23.702306          9 MS->MED; MS->TIME; PD->MS; PD->TIME; PROC->MED; PROC->MS; PROC->PD; PROC->TIME; TIME->MED
   10 −325490.956496       −28.859896          9  MED->MS; MED->PROC; PD->MED; PD->MS; PD->PROC; TIME->MED; TIME->MS; TIME->PD; TIME->PROC
   11 −325492.701207       −30.604607          9  MED->MS; MED->PD; MED->PROC; MED->TIME; PD->MS; PD->PROC; PD->TIME; PROC->TIME; TIME->MS
   12 −325497.198861       −35.102261          9 MS->MED; MS->PD; MS->PROC; MS->TIME; PD->PROC; PD->TIME; PROC->MED; PROC->TIME; TIME->MED
   13 −325515.245995       −53.149395          9 MS->MED; PD->MS; PD->PROC; PD->TIME; PROC->MED; PROC->MS; PROC->TIME; TIME->MED; TIME->MS
   14 −325515.929506       −53.832906          9  MED->MS; PD->MED; PD->MS; PD->PROC; PD->TIME; PROC->MED; PROC->TIME; TIME->MED; TIME->MS
   15 −325516.851315       −54.754714          9 MS->MED; MS->PROC; MS->TIME; PD->MS; PD->PROC; PD->TIME; PROC->MED; TIME->MED; TIME->PROC
   16 −325523.301977       −61.205376          9 MS->MED; PD->MS; PD->PROC; PD->TIME; PROC->MED; PROC->MS; TIME->MED; TIME->MS; TIME->PROC
   17 −325523.985487       −61.888887          9  MED->MS; PD->MED; PD->MS; PD->PROC; PD->TIME; PROC->MED; TIME->MED; TIME->MS; TIME->PROC
   18 −325525.334124       −63.237524          9 MS->MED; PD->MS; PD->TIME; PROC->MED; PROC->MS; PROC->PD; PROC->TIME; TIME->MED; TIME->MS
   19 −325526.017635       −63.921035          9  MED->MS; PD->MED; PD->MS; PD->TIME; PROC->MED; PROC->PD; PROC->TIME; TIME->MED; TIME->MS
   20 −325526.647310       −64.550710          9  MED->PROC; MED->TIME; MS->MED; MS->TIME; PD->MED; PD->MS; PD->PROC; PD->TIME; TIME->PROC


================================================================================
BEST DAG
================================================================================
MED->MS; MED->PROC; MED->TIME; PD->MED; PD->MS; PD->PROC; PD->TIME; TIME->MS; TIME->PROC

Total log evidence: −325462.096600

Local score contributions by node:
       PD | parents=()                                  score=-48378.393146
       MS | parents=(‘MED’, ‘PD’, ‘TIME’)               score=-78008.395650
     TIME | parents=(‘MED’, ‘PD’)                       score=-67446.975997
      MED | parents=(‘PD’,)                             score=-79906.821619
     PROC | parents=(‘MED’, ‘PD’, ‘TIME’)               score=-51721.510188

I didn’t love this output. I couldn’t really get a sense of what was going on there. I wanted to better understand the space of all causal models over each set of variables.

I decided to plot the log evidence over all of the models to see what that looked like. Here’s what it looked like for that first set of variables, the ones I suspected were pretty entangled.

I’m new at this so I’m still figuring out how to interpret what I’m looking at, but this sure looked like there was not much of a causal story here. Some DAGs were better than others but they were all in one big lump, smoothly degrading in score as the nodes got shuffled around.

Next I tried it on the second set of variables, the ones I expected to tell a much better causal story, and sure enough, the plot got way more interesting.

What we’re seeing here is that the first six thousand or so DAGs all did pretty well, but then that cluster was played out, and the next eight thousand or so DAGs all performed about the same as each other, and so on.

So… what did that first cluster have in common?

I added some analysis and labeling to my graph, shown there in red. DAGs in the best-performing cluster tended to have ATYPE and SOURCE as parents to MS, whereas DAGs in the second best cluster had MS without a parent. Also, in the first cluster, ATYPE tended to have MS and SOURCE as parents, whereas in the second cluster, ATYPE only had SOURCE as a parent.

Obviously this set of characteristics is mutually incompatible. These are DAGs. You can’t have ATYPE with MS as a parent and also MS with ATYPE as a parent. That’s fine; this is just analysis over all six-thousand top-scoring DAGs. What it’s really saying is that MS, ATYPE, and SOURCE were all entangled with each other and needed to be at the top of the causal graph together, and when you broke that relationship, you got worse DAGs. Versions of that same story were the largest factor in every step down. There were probably other issues, not involving the entanglement of those three variables, that explained why the steps were so far apart from each other, but the entanglement swamped the other issues and didn’t show up in the cluster labeling.

The next step here would be to throw away one or two of those three entangled variables and see what causal graphs we can get with only the remaining data, but I was happy enough with my DAG testing and scoring code at this point that I let it go and moved on.

Synthetic test data generation

Next up, I made a synthetic data set with a known DAG. The generator for my data set looks like this:

The first version of the story went like this.

I wake up in the morning and check my MOOD. If I feel social (50% chance), then I check if a FRIEND is available that day (40% chance). If there’s a friend, and if the WEATHER is snowy (10%) chance, then we do the ACTIVITY of skiing. If there’s a friend and it’s sunny (60% chance), then we have a picnic. If it’s rainy, or if there’s no friend, or if I feel focused, then I read books. Meanwhile, if it’s snowy, or 40% of the time when it’s rainy, I want to wear a SWEATER. And if I feel social then there’s an 80% chance of MUSIC, but if I feel focused then there’s only a 20% chance.

Then I generated 10k data points from that distribution, and tried to fit DAGs to the observed data to see what would happen.

Plotting DAGs for the synthetic data: skiing causes snow

My first run of this didn’t go great. It turned out that my data was too clean. A lot of cells were zero (you never see focused, sunny, and skiing together) and some had pretty small numbers (even social, friend available, and snowy is a rare combo). Having such a clean, deterministic data set caused a bunch of problems that I am not going to summarize here because it gets too into the weeds.

(Claude, my trusty assistant, seemed genuinely quite distressed that we were modeling the data as demonstrating that skiing caused snow. I thought for a while that it was wrong to be distressed, because some causal models are Markov-equivalent; e.g. there’s no way to tell if SWEATERs cause WEATHER or the other way around, in this data set, and that’s fine. But Claude was right to be a bit upset about it; the ACTIVITY part of the graph should in theory be fully recoverable.)

So I tried adding some noise to the data. I said that 90% of the time, I’d do the intended activity, but 10% of the time I’d pick one of the other two activities at random. Sometimes you just find yourself skiing in the rain even though you feel focused today, y’know?

This helped! Running with noisier data caused an almost-correct DAG to rise to the top—but I was unhappy with how close together the scores were. The fifth-scoring DAG postulated three extra edges between the independent variables of MOOD, WEATHER, and FRIEND, and I didn’t like to see that.

(As previously mentioned, the SWEATER/WEATHER and MOOD/MUSIC arrow flips are unavoidable.)

================================================================================
EXHAUSTIVE DAG SEARCH SUMMARY
================================================================================
Oriented graphs checked: 14348907
Acyclic DAGs found:      3781503

Top 20 DAGs
 rank  log_evidence  delta_from_best  num_edges                                                                                                                              dag
    1 −33771.843508         0.000000          5                                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
    2 −33771.843511        −0.000003          5                                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->WEATHER; WEATHER->ACTIVITY
    3 −33772.001484        −0.157976          5                                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->SWEATER
    4 −33772.001487        −0.157979          5                                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; WEATHER->ACTIVITY; WEATHER->SWEATER
    5 −33772.273770        −0.430262          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; ACTIVITY->WEATHER; FRIEND->MOOD; MOOD->MUSIC; WEATHER->FRIEND; WEATHER->MOOD; WEATHER->SWEATER
    6 −33772.661834        −0.818327          7                ACTIVITY->FRIEND; MOOD->ACTIVITY; MOOD->FRIEND; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY; WEATHER->FRIEND
    7 −33772.661838        −0.818330          7                ACTIVITY->FRIEND; MOOD->ACTIVITY; MOOD->FRIEND; MUSIC->MOOD; SWEATER->WEATHER; WEATHER->ACTIVITY; WEATHER->FRIEND
    8 −33772.673327        −0.829819          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; ACTIVITY->WEATHER; MOOD->FRIEND; MOOD->MUSIC; WEATHER->FRIEND; WEATHER->MOOD; WEATHER->SWEATER
    9 −33772.819810        −0.976302          7                ACTIVITY->FRIEND; MOOD->ACTIVITY; MOOD->FRIEND; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->FRIEND; WEATHER->SWEATER
   10 −33772.819813        −0.976305          7                ACTIVITY->FRIEND; MOOD->ACTIVITY; MOOD->FRIEND; MUSIC->MOOD; WEATHER->ACTIVITY; WEATHER->FRIEND; WEATHER->SWEATER
   11 −33773.618590        −1.775082          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; ACTIVITY->WEATHER; FRIEND->MOOD; FRIEND->WEATHER; MOOD->MUSIC; WEATHER->MOOD; WEATHER->SWEATER
   12 −33773.784473        −1.940965          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; FRIEND->MOOD; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY; WEATHER->FRIEND; WEATHER->MOOD
   13 −33773.942449        −2.098941          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; FRIEND->MOOD; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->FRIEND; WEATHER->MOOD; WEATHER->SWEATER
   14 −33774.002568        −2.159060          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; ACTIVITY->WEATHER; MOOD->FRIEND; MOOD->MUSIC; MOOD->WEATHER; WEATHER->FRIEND; WEATHER->SWEATER
   15 −33774.184029        −2.340522          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; MOOD->FRIEND; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY; WEATHER->FRIEND; WEATHER->MOOD
   16 −33774.342005        −2.498497          8 ACTIVITY->FRIEND; ACTIVITY->MOOD; MOOD->FRIEND; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->FRIEND; WEATHER->MOOD; WEATHER->SWEATER
   17 −33774.409236        −2.565728          6                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
   18 −33774.567212        −2.723704          6                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->MUSIC; WEATHER->ACTIVITY; WEATHER->SWEATER
   19 −33774.766165        −2.922657          6                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; MUSIC->SWEATER; SWEATER->WEATHER; WEATHER->ACTIVITY
   20 −33774.766168        −2.922660          6                               FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; MUSIC->SWEATER; SWEATER->WEATHER; WEATHER->ACTIVITY

At this point, I was annoyed because I wanted to see a larger dropoff from the top four DAGs (all of which were as good as possible) to the next batch. I tried decreasing alpha (a parameter on the prior) by a bunch, effectively encouraging DAGS with fewer edges. This helped a little.

================================================================================
EXHAUSTIVE DAG SEARCH SUMMARY
================================================================================
Oriented graphs checked: 14348907
Acyclic DAGs found:      3781503

Top 20 DAGs
 rank  log_evidence  delta_from_best  num_edges                                                                                                 dag
    1 −33844.181016     0.000000e+00          5                  FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
    2 −33844.181016    −3.240712e-08          5                  FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->WEATHER; WEATHER->ACTIVITY
    3 −33844.463503    −2.824879e-01          5                  FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->SWEATER
    4 −33844.463503    −2.824879e-01          5                  FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; WEATHER->ACTIVITY; WEATHER->SWEATER
    5 −33850.671384    −6.490368e+00          6  FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
    6 −33850.674956    −6.493940e+00          6  FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; MUSIC->SWEATER; SWEATER->WEATHER; WEATHER->ACTIVITY
    7 −33850.674956    −6.493940e+00          6  FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; MUSIC->SWEATER; SWEATER->WEATHER; WEATHER->ACTIVITY
    8 −33850.726915    −6.545899e+00          6  FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; MUSIC->SWEATER; WEATHER->ACTIVITY; WEATHER->SWEATER
    9 −33850.726915    −6.545899e+00          6  FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; MUSIC->SWEATER; WEATHER->ACTIVITY; WEATHER->SWEATER
   10 −33850.953872    −6.772856e+00          6  FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->MUSIC; WEATHER->ACTIVITY; WEATHER->SWEATER
   11 −33851.291394    −7.110378e+00          6    FRIEND->ACTIVITY; FRIEND->MOOD; MOOD->ACTIVITY; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
   12 −33851.291838    −7.110823e+00          6    FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->FRIEND; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
   13 −33851.291838    −7.110823e+00          6    FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->FRIEND; MUSIC->MOOD; SWEATER->WEATHER; WEATHER->ACTIVITY
   14 −33851.302476    −7.121460e+00          6 FRIEND->ACTIVITY; MOOD->ACTIVITY; MOOD->MUSIC; SWEATER->FRIEND; SWEATER->WEATHER; WEATHER->ACTIVITY
   15 −33851.302476    −7.121460e+00          6 FRIEND->ACTIVITY; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->FRIEND; SWEATER->WEATHER; WEATHER->ACTIVITY
   16 −33851.305604    −7.124588e+00          6 FRIEND->ACTIVITY; FRIEND->SWEATER; MOOD->ACTIVITY; MOOD->MUSIC; SWEATER->WEATHER; WEATHER->ACTIVITY
   17 −33851.305604    −7.124588e+00          6 FRIEND->ACTIVITY; FRIEND->SWEATER; MOOD->ACTIVITY; MUSIC->MOOD; SWEATER->WEATHER; WEATHER->ACTIVITY
   18 −33851.475408    −7.294393e+00          6 FRIEND->ACTIVITY; FRIEND->SWEATER; MOOD->ACTIVITY; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->SWEATER
   19 −33851.475408    −7.294393e+00          6 FRIEND->ACTIVITY; FRIEND->SWEATER; MOOD->ACTIVITY; MUSIC->MOOD; WEATHER->ACTIVITY; WEATHER->SWEATER
   20 −33851.573882    −7.392866e+00          6    FRIEND->ACTIVITY; FRIEND->MOOD; MOOD->ACTIVITY; MOOD->MUSIC; WEATHER->ACTIVITY; WEATHER->SWEATER

I could do more tuning, of course, but at some point I’m not sure what to port back from my testing setup to my real-world model comparison. Anything I do to the synthetic data (making more of it, making it noisier) doesn’t tell me much about whether the code is good. I’m not sure how or whether to port back the prior parameter tuning. I’m concerned about my real world data being quite sparse, and some of the same problems I saw on the test data set cropping up over there for the same reasons. And so on.

This is a good place to stop for today.

^
Or at least, as one does, if one is in the process of metamorphosing into a technical alignment research, and one’s mentor is John Wentworth.
^
Especially if anyone asks me to! John wrote such a sequence about seven years ago, but it’s pretty brief and doesn’t go into a ton of detail. If anyone would get value out of having something more detailed, please say so.

Causal inference diary: skiing causes snow

Plotting DAG scores

Synthetic test data generation

Plotting DAGs for the synthetic data: skiing causes snow