“Inductive Bias”

(Part two in a series on “statistical bias”, “inductive bias”, and “cognitive bias”.)

Suppose that you see a swan for the first time, and it is white. It does not follow logically that the next swan you see must be white, but white seems like a better guess than any other color. A machine learning algorithm of the more rigid sort, if it sees a single white swan, may thereafter predict that any swan seen will be white. But this, of course, does not follow logically—though AIs of this sort are often misnamed “logical”. For a purely logical reasoner to label the next swan white as a deductive conclusion, it would need an additional assumption: “All swans are the same color.” This is a wonderful assumption to make if all swans are, in reality, the same color; otherwise, not so good. Tom Mitchell’s Machine Learning defines the inductive bias of a machine learning algorithm as the assumptions that must be added to the observed data to transform the algorithm’s outputs into logical deductions.

A more general view of inductive bias would identify it with a Bayesian’s prior over sequences of observations...

Consider the case of an urn filled with red and white balls, from which we are to sample without replacement. I might have prior information that the urn contains 5 red balls and 5 white balls. Or, I might have prior information that a random number was selected from a uniform distribution between 0 and 1, and this number was then used as a fixed probability to independently generate a series of 10 balls. In either case, I will estimate a 50% probability that the first ball is red, a 50% probability that the second ball is red, etc., which you might foolishly think indicated the same prior belief. But, while the marginal probabilities on each round are equivalent, the probabilities over sequences are different. In the first case, if I see 3 red balls initially, I will estimate a probability of 27 that the next ball will be red. In the second case, if I see 3 red balls initially, I will estimate a 45 chance that the next ball will be red (by Laplace’s Law of Succession, thus named because it was proved by Thomas Bayes). In both cases we refine our future guesses based on past data, but in opposite directions, which demonstrates the importance of prior information.

Suppose that your prior information about the urn is that a monkey tosses balls into the urn, selecting red balls with 14 probability and white balls with 34 probability, each ball selected independently. The urn contains 10 balls, and we sample without replacement. (E. T. Jaynes called this the “binomial monkey prior”.) Now suppose that on the first three rounds, you see three red balls. What is the probability of seeing a red ball on the fourth round?

First, we calculate the prior probability that the monkey tossed 0 red balls and 10 white balls into the urn; then the prior probability that the monkey tossed 1 red ball and 9 white balls into the urn; and so on. Then we take our evidence (three red balls, sampled without replacement) and calculate the likelihood of seeing that evidence, conditioned on each of the possible urn contents. Then we update and normalize the posterior probability of the possible remaining urn contents. Then we average over the probability of drawing a red ball from each possible urn, weighted by that urn’s posterior probability. And the answer is… (scribbles frantically for quite some time)… 1/​4!

Of course it’s 14. We specified that each ball was independently tossed into the urn, with a known 14 probability of being red. Imagine that the monkey is tossing the balls to you, one by one; if it tosses you a red ball on one round, that doesn’t change the probability that it tosses you a red ball on the next round. When we withdraw one ball from the urn, it doesn’t tell us anything about the other balls in the urn.

If you start out with a maximum-entropy prior, then you never learn anything, ever, no matter how much evidence you observe. You do not even learn anything wrong—you always remain as ignorant as you began.

The more inductive bias you have, the faster you learn to predict the future, but only if your inductive bias does in fact concentrate more probability into sequences of observations that actually occur. If your inductive bias concentrates probability into sequences that don’t occur, this diverts probability mass from sequences that do occur, and you will learn more slowly, or not learn at all, or even—if you are unlucky enough—learn in the wrong direction.

Inductive biases can be probabilistically correct or probabilistically incorrect, and if they are correct, it is good to have as much of them as possible, and if they are incorrect, you are left worse off than if you had no inductive bias at all. Which is to say that inductive biases are like any other kind of belief; the true ones are good for you, the bad ones are worse than nothing. In contrast, statistical bias is always bad, period—you can trade it off against other ills, but it’s never a good thing for itself. Statistical bias is a systematic direction in errors; inductive bias is a systematic direction in belief revisions.

As the example of maximum entropy demonstrates, without a direction to your belief revisions, you end up not revising your beliefs at all. No future prediction based on past experience follows as a matter of strict logical deduction. Which is to say: All learning is induction, and all induction takes place through inductive bias.

Why is inductive bias called “bias”? Because it has systematic qualities, like a statistical bias? Because it is a form of pre-evidential judgment, which resembles the word “prejudice”, which resembles the political concept of bias? Damned if I know, really—I’m not the one who decided to call it that. Words are only words; that’s why humanity invented mathematics.