Algorithms as Case Studies in Rationality

This post springs out of a very long line of thought, which I will only summarise small parts of.

It began with the puzzling realisation that the algorithm which computers use to perform symbolic integration is radically different from the guess-and-check method taught in schools. My first reaction was, why are we not taught the systematic way of doing it? This is true for other areas of mathematics as well. In a few cases, such as solving quadratic equations or systems of linear equations, students are eventually taught the fast way. However, in many cases, the existence of an algorithm is not even mentioned.

My point, however, is not to criticise the educational practice; in fact, I agree with the idea of teaching mathematics as an exploration of ideas rather than an application of formulaic solution methods. Rather, I would like to encourage people to eventually learn the algorithms, and try to apply them. A good algorithm is a study in rational behaviour, and I think we can take home a lesson from each.

I’ll just give two examples which I find particularly fascinating: Knuth-Bendix completion and the summary-product algorithm. The first is most relevant to fast mathematical reasoning. The second is relevant to studying the reasonableness of fast-and-messy probabilistic reasoning, the way humans do it.

Knuth-Bendix Completion

Knuth-Bendix completion (“K-B” from now on) is a fascinating formalisation of the mathematical idea of simplification. Everyone should recognise the phrase “simplify your answer” from high school homework assignments. I never suspected that simplification could be a really powerful method for mathematical reasoning.

K-B is a method for reasoning about equality. Equality is one of the simplest relationships we can have between two things, yet the naive way of reasoning about it results in an explosion of possible inferences. If there are N things in our universe, there are N2 equality statements which may be true or false. We can substitute equals for equals all day and get nowhere. So why does it seems like such a simple relation? I think the K-B algorithm gives a good answer to that.

We start with the basic axioms for the domain we are reasoning about. The domain consists of a set of expressions, some of which are equal to others. The axioms provide some general equalities; for example, they might be the trigonometric identities, if we are reasoning about trigonometric expressions.

First, we decide on a relatively arbitrary ordering of “simplicity” over the expressions we’re using. Practical considerations and personal preferences will go into this, but for the algorithm itself, there are only a few requirements: it should be a well-ordering (so any two expressions are comparable, and there are no infinitely descending chains), and it should be consistent in the sense that if we replace a sub-expression with a simpler sub-expression, the whole expression gets simpler.

Second, we take the starting set of identities, and now treat them as rewrite rules: we judge which direction goes from complex to simple and we only apply them in that direction. Since expressions can only get simpler, we now know that we won’t just keep producing equality statements endlessly without getting where we want to go.

Unfortunately, we lose some information by just taking one side of the equality. The truth is, it’s sometimes necessary to alter an expression in a way which at first makes it more complicated, to allow application of another rule which eventually gets us to a simpler form.

The genius of K-B is to find these situations, referred to as critical pairs, and add new simplification rules which abbreviate such steps: Third, look for critical pairs and add the necessary rules. (I won’t try to give the exact statement of this step; I refer the reader again to the Wolfram article. [Edit: Or the Wikipedia article.]) This step is repeated until the set of rules is complete. (It will sometimes happen that completion is impossible, sadly.)

Once the rules are complete, we can reason purely by simplification. The procedure “simplify” is embodied by looking through our list of rules and applying them in the simplifying direction, until we can’t find one that applies anymore. The result of simplification is referred to as the normal form of an expression. We can test whether two expressions are equal by putting them in normal form: if the two are equal, they will have the same normal form; if they are not equal, they won’t.

How can this be applied to life? Well, I admit that this one only applies if you do fairly advanced math on a regular basis (though, since I’m writing for rationalists, perhaps you’ll be sympathetic if I encourage you to learn more mathematics and to apply it to your life!). In my experience, the K-B algorithm matches the behaviour of sophisticated mathematical reasoning fairly well; people will have a preferred way of simplifying different sorts of expressions, and the interesting theorems correspond to new simplifications which can only be proved by initially reasoning in a direction which looks more complex.

The second algorithm I’ll review is more applicable to everyday situations.

Summary-Product

Summary-product is a recent algorithm, but one with deep roots. The interesting thing about this algorithm is that it is a general case of dozens of the most important algorithms in different specific areas. The PDF I linked to is a good introduction to this aspect of the algorithm, and I encourage readers to check it out; for my overview, however, I’ll only look at the special case which is known as the belief propagation algorithm.

Belief propagation is a way of reasoning about probabilities in a tractable way.

First, we need to factor our beliefs into a large number of local belief functions, which deal only with a few statements. These represent local probabilistic dependencies between facts; perhaps conditional probabilities (in what’s called a Bayesian Network) or just weights which contribute to probabilities (in what’s called Markov Networks). What’s important is that if we multiplied all the local functions together, we would get a coherent probability distribution.

Next, we take stock of the evidence. This gives fixed beliefs to some of the variables.

What we want to find now are beliefs for the other variables, referred to as marginal probability distributions. The naive way of finding these is to sum over the entire state-space. For N unobserved variables, we have an N-dimensional state-space; we want to sum over every variable but the one which we’re calculating a belief for. If every dimension is of size v, we have vN entries to sum up.

The idea which leads to the belief propagation algorithm is that we can do better by using the factorisation, thanks to the distributive law: we push the sums as far in as we can, so that whenever possible, we sum over a dimension before we multiply. If the dependencies are treelike, this will give us a calculation which takes time proportional in N, rather than vN; a huge gain.

Turning this into a local propagation algorithm gives us rules for updating a belief in relation to its neighbours. This has two advantages.

First, it happens that we can calculate a belief for all of the variables in just twice the time that it takes to calculate the belief for the one; that is, 2N time. Why? Computing the belief function for one variable amounts to propagation starting at the far edges of the network and heading towards that variable. This loads the network with most of the information it needs. We can finish off the computation by simply propagating in the reverse direction, away from that variable; this spreads the evidence all around the network like jelly on toast, completing the marginalisation of every variable in 2N time.

Second, a surprising feature of the propagation algorithm is that it gives good approximate results for non-treelike factorisations. This is like saying that we can push in sums even when the distributive law doesn’t hold! Rather than 2N propagations, we just keep propagating until our beliefs converge to stable values (or until we’re fed up with waiting). In many situations, this will still converge to exact results or useful estimates. This makes belief propagation a reasonable candidate for general inference.

I think there are many lessons for rationality hidden in the details of the belief propagation algorithm. In fact, since belief propagation is a core Bayesian activity, there are already posts here dealing with some of the important lessons. I’ll just mention one which I find quite striking, personally.

A naive method of belief propagation might update the belief of a variable based on the raw estimated marginal which the neighbour currently has. If two neighbouring beliefs X and Y imply each other with high probability, though, we’d run into trouble: suppose some evidence raises our belief in X. Then we can propagate this to Y, raising our belief there. But then we notice that once again, a neighbour of X has changed, and we can re-update X; performing belief propagation raises our belief in X once more. We’d go back and forth like this until X and Y lock in with probability indistinguishable from 1.

The actual rules of belief propagation stop this from happening with the idea of a directional message. The message which X sends to Y is the belief we compute for X with all the evidence except the evidence coming from Y. In general, the belief which gets sent in a particular direction uses all the evidence except the evidence coming back that same direction.

This prevents circular reasoning. For all the the talk about the fallacy of circular reasoning, classical logic does not really have a good treatment of it. For a statement A, it’s perfectly valid to prove A from an assumption of A in classical logic; in fact we can prove A->A for any proposition (where “->” is material implication)! I think it’s not until we think about tractable probabilistic reasoning that we really get a good account of what’s wrong with circular reasoning: it makes us count the same evidence more than once.

Next time you’re talking with someone, try to keep track of the directionality of the beliefs being propagated by an argument. Yes, I’m serious: it works! In fact, most of the time it’s rather trivial. Sometimes, though, it gives an interesting insight into what’s going on—often cases where classical logic tells us that an inference is just fine, but informal pragmatics tell us that there is something silly about it. (I think there’s some value in knowing precisely what we mean by silly.)

Conclusion

I presented very rough glosses of these two algorithms, and I’d encourage everyone to read more about the details in other places. However, what I’m really trying to drive at here is a broader point: algorithms which have been developed by computer scientists can really give us useful lessons about rationality. Often we find that if it’s a good way for a computer to think, it’s a good way for us to think, too. This does not mean memorising and mechanically executing an algorithm, of course; it means understanding the trick that makes the algorithm good, and applying that trick whenever we think it’s relevant, until it becomes second nature.

Question for discussion: what algorithms have you applied to your own thought processes? What algorithms might you be interested in applying?