Fun With DAGs

Cross­posted here.

Directed acyclic graphs (DAGs) turn out to be re­ally fun­da­men­tal to our world. This post is just a bunch of neat stuff you can do with them.

Utility Functions

Sup­pose I pre­fer chicken over steak, steak over pork, and pork over chicken. We can rep­re­sent my prefer­ences with a di­rected graph like this:

This poses a prob­lem. If I’m hold­ing a steak, then I’ll pay a penny to up­grade it to chicken — since I pre­fer chicken over steak. But then I’ll pay an­other penny to up­grade my chicken to pork, and an­other to up­grade it to steak, and then we’re back to the be­gin­ning and I’ll pay to up­grade the steak to chicken again! Some­one can stand there switch­ing out chicken/​pork/​steak with me, and col­lect my money all day.

Any time our prefer­ence graph con­tains cy­cles, we risk this sort of “money-pump”. So, to avoid be­ing money-pumped, let’s make our prefer­ences not con­tain any cy­cles:

Now our prefer­ences are di­rected (the ar­rows), acyclic (no cy­cles), and a graph (ar­row-and-box draw­ing above). It’s a DAG!

No­tice that we can now sort the DAG nodes, spa­tially, so that each node only points to higher-up nodes:

Read­ing off the nodes from bot­tom to top, the or­der is: chicken, pork, steak.

This is called a topolog­i­cal sort, or topo sort for short. We can always topo sort a DAG — there is always some or­der of the nodes such that each node only points to nodes af­ter it in the sort or­der.

In this case, the topo sort or­ders our prefer­ences. We pre­fer pork over ev­ery­thing which comes be­fore pork in the topo sort, and so forth. We could even num­ber the items ac­cord­ing to their topo-sorted or­der: 1 for chicken, 2 for pork, 3 for steak. We pre­fer whichever thing has the higher num­ber, i.e. the higher po­si­tion in the topo-sort or­der.

This is called a util­ity func­tion. This util­ity func­tion U is defined by U(chicken) = 1, U(pork) = 2, U(steak) = 3. We pre­fer things with higher util­ity — in other words, we want to max­i­mize the util­ity func­tion!

To sum­ma­rize: if our prefer­ences do con­tain cy­cles, we can be money-pumped. If they don’t, then our prefer­ences form a DAG, so we can topo sort to find a util­ity func­tion.

Circuits

Here’s a re­ally sim­ple cir­cuit to com­pute f = (x+y)*(x-y):

No­tice that the cir­cuit is a DAG. In this case, the topo sort tells us the or­der in which things are com­puted: “+” and “-” both come be­fore “x” (the mul­ti­pli­ca­tion, not the in­put). If the graph con­tained any cy­cles, then we wouldn’t know how to eval­u­ate it — if the value of some node changed as we went around a cy­cle, we might get stuck up­dat­ing in cir­cles in­definitely!

It’s the DAG struc­ture that makes a cir­cuit nice to work with: we can eval­u­ate things in or­der, and we only need to eval­u­ate each node once.

Dy­namic Programming

Here’s a clas­sic al­gorithms prob­lem:

Sup­pose we want to com­pute the num­ber of paths from cor­ner A to cor­ner B, trav­el­ling only down and right along the lines. (I’m about to give a solu­tion, so pause to think if you don’t want to be spoilered.)

Our main trick is that the num­ber of paths to B is the num­ber of paths to C plus the num­ber of paths to D. Like­wise, for any other node, the num­ber of paths to the node is the sum of the num­bers above and to the left. So we can turn the pic­ture above into a cir­cuit, by stick­ing “+” op­er­a­tions at each node and filling in the ar­row­heads:

I’ve omit­ted all the tiny “+” signs, but each node in this cir­cuit sums the val­ues from the in­com­ing ar­rows. Start with 1 at cor­ner A (there is only one path from A to it­self), and the cir­cuit will out­put the num­ber of paths from A to B at cor­ner B.

Why is this in­ter­est­ing? Well, con­sider cod­ing this up. We can make a sim­ple re­cur­rent func­tion:

f(row, col):  
    if row == 0 or col == 0:  
        re­turn 1  
    re­turn f(row-1, col) + f(row, col-1)

… but this func­tion is ex­tremely in­effi­cient. In effect, it starts at B, then works back­ward to­ward A, ex­plor­ing ev­ery pos­si­ble path through the grid and adding them all up. The to­tal run­time will be ex­po­nen­tial in the size of the grid.

How­ever, note that there are not that many com­bi­na­tions of (row, col) at which f will be eval­u­ated — in­deed, there’s only one (row, col) com­bi­na­tion for each node in the grid. The in­effi­ciency is be­cause our sim­ple re­cur­rent func­tion calls f(row, col) mul­ti­ple times for each node, and runs the full com­pu­ta­tion each time. In­stead, we could just com­pute f for each node once.

How can we do that? We already did! We just need to treat the prob­lem as a cir­cuit. Re­mem­ber, when eval­u­at­ing a cir­cuit, we work in topo-sorted or­der. In this case, that means start­ing from A, which means start­ing from f(0,0). Then we store that value some­where, and move on — work­ing along the top f(0, col) and the side f(row, 0). In gen­eral, we work from up­per left to lower right, fol­low­ing topo sort or­der, and stor­ing re­sults as we go. Rather than us­ing func­tion calls, as in the re­cur­sive for­mu­la­tion, we just lookup stored val­ues. As long as we fol­low topo sort or­der, each value we need will be stored be­fore we need it.

This is “dy­namic pro­gram­ming”, or DP: tak­ing a re­cur­sive func­tion, and turn­ing it into a cir­cuit. Tra­di­tion­ally, DP is taught with ta­bles, but I find this deeply con­fus­ing — the key to DP is that it isn’t re­ally about ta­bles, it’s about DAGs. We take the com­pu­ta­tional DAG of some re­cur­sive func­tion, and turn it into a cir­cuit. By eval­u­at­ing in topo-sort or­der, we avoid re-eval­u­at­ing the func­tion at the same node twice.

Tur­ing-Com­putable Func­tions: Cir­cuits + Symmetry

To turn cir­cuits into fully gen­eral Tur­ing-com­putable func­tions (i.e. any­thing which can be writ­ten in a pro­gram­ming lan­guage), we just need to al­low re­cur­sion.

Here’s a re­cur­sive fac­to­rial func­tion:

f(n):  
    if n == 0:  
        re­turn 1  
    re­turn n * f(n-1)

We can rep­re­sent this as an in­finite cir­cuit:

Each of the dashed boxes cor­re­sponds to a copy of the func­tion f. The in­finite lad­der han­dles the re­cur­sion — when f calls it­self, we move down into an­other box. We can view this as a sym­me­try re­la­tion: the full cir­cuit is equiv­a­lent to the top box, plus a copy of the full cir­cuit pasted right be­low the top box.

Of course, if we ac­tu­ally run the code for f, it won’t run for­ever — we won’t eval­u­ate the whole in­finite cir­cuit! Be­cause of the con­di­tional “n == 0?”, the be­hav­ior of the cir­cuit be­low some box is ir­rele­vant to the fi­nal value, so we don’t need to eval­u­ate the whole thing. But we will eval­u­ate some sub­set of the cir­cuit. For n = 2, we would eval­u­ate this part of the cir­cuit:

This is the “com­pu­ta­tional DAG” for f(2). While the code for f defines the func­tion, the com­pu­ta­tional DAG shows the com­pu­ta­tion — which steps are ac­tu­ally performed, in what or­der, with what de­pen­den­cies.

Parallelization

The com­pu­ta­tional DAG forms the ba­sis for par­alleliza­tion: speed­ing up an al­gorithm by run­ning on mul­ti­ple cores at once.

Con­sider our sim­ple cir­cuit from ear­lier:

Where can we perform steps in par­allel? Well, we can eval­u­ate the “-” and “+” at the same time. But we can’t perform the “x” un­til af­ter both the “-” and “+” are done: the “x” re­quires the re­sults from those two steps.

More gen­er­ally, look at the topo sort or­der of the cir­cuit. The topo sort is not unique — “+” could come be­fore or af­ter “-”, since there’s no di­rected path be­tween them. That means they can be performed in par­allel: nei­ther re­quires the re­sult from the other. On the other hand, “x” comes strictly later in the topo sort or­der, be­cause it does re­quire the other val­ues as in­put.

For a more com­pli­cated func­tion, we’d have to think about the com­pu­ta­tional DAG. When­ever one node comes strictly af­ter an­other in the com­pu­ta­tional DAG, those two nodes can­not be par­allelized. But as long as nei­ther node comes strictly af­ter the other, we can par­allelize them. For in­stance, in our DP ex­am­ple above, the points C and D in the grid could be eval­u­ated in par­allel.

This sort of think­ing isn’t re­stricted to com­puter sci­ence. Sup­pose we have a com­pany pro­duc­ing cus­tom mail-or­der wid­gets, and ev­ery time an or­der comes in, there’s a bunch of steps to crank out the wid­get:

Some steps de­pend on oth­ers, but some don’t. We can con­firm the ad­dress and print the la­bel in par­allel to pro­duc­ing the wid­get it­self, in or­der to mail it out sooner. A lot of op­ti­miza­tion in real-world com­pany pro­cesses looks like this.

Causality

Here’s a clas­sic ex­am­ple from Pearl:

The sea­son could cause rain, or it could cause you to run the sprin­kler, but the sea­son will not di­rectly cause the side­walk to be wet. The side­walk will be wet in some sea­sons more than oth­ers, but that goes away once you con­trol for rain and sprin­kler us­age. Similarly, rain can cause the side­walk to be wet, which causes it to be slip­pery. But if some­thing keeps the side­walk dry — a cov­er­ing, for in­stance — then it won’t be slip­pery no mat­ter how much it rains; there­fore rain does not di­rectly cause slip­per­i­ness.

Th­ese are the sort of com­mon-sense con­clu­sions we can en­code in a causal DAG. The DAG’s ar­rows show di­rect cause-effect re­la­tions, and paths of ar­rows show in­di­rect cause and effect.

A few of the neat things we can do with a causal DAG:

  • Math­e­mat­i­cally rea­son about how ex­ter­nal changes (e.g. cov­er­ing a side­walk) will im­pact cause-effect systems

  • Ac­count for causal­ity when up­dat­ing be­liefs. For in­stance, a wet side­walk makes rain more likely, but not if we know the sprin­kler ran.

  • Find a com­pact rep­re­sen­ta­tion for the prob­a­bil­ity dis­tri­bu­tion of the vari­ables in the DAG

  • Statis­ti­cally test whether some data is com­pat­i­ble with a par­tic­u­lar causal DAG

Most of these are out­side the scope of this post, but Highly Ad­vanced Episte­mol­ogy 101 has more.

The DAG struc­ture of causal­ity is the main rea­son for the ubiquity of DAGs: causal­ity is ev­ery­where, so DAGs are ev­ery­where. Cir­cuits are re­ally just cause-effect speci­fi­ca­tions for calcu­la­tions. Not util­ity, though — that one’s kind of an out­lier.