Embedded Naive Bayes

Sup­pose we have a bunch of earth­quake sen­sors spread over an area. They are not perfectly re­li­able (in terms of ei­ther false pos­i­tives or false nega­tives), but some are more re­li­able than oth­ers. How can we ag­gre­gate the sen­sor data to de­tect earth­quakes?

A “naive” seis­mol­o­gist with­out any statis­tics back­ground might try as­sign­ing differ­ent nu­mer­i­cal scores to each sen­sor, roughly in­di­cat­ing how re­li­able their pos­i­tive and nega­tive re­sults are, just based on the seis­mol­o­gist’s in­tu­ition. Sen­sor i gets a score when it’s go­ing off, and when it’s not. Then, the seis­mol­o­gist can add up the scores for all sen­sors go­ing off at a given time, plus the scores for sen­sors not go­ing off, to get an ag­gre­gate “earth­quake score”. As­sum­ing the seis­mol­o­gist has de­cent in­tu­itions for the sen­sors, this will prob­a­bly work just fine.

It turns out that this pro­ce­dure is equiv­a­lent to a Naive Bayes model.

Naive Bayes is a causal model in which there is some pa­ram­e­ter in the en­vi­ron­ment which we want to know about—i.e. whether or not there’s an earth­quake hap­pen­ing. We can’t ob­serve di­rectly, but we can mea­sure it in­di­rectly via some data - i.e. out­puts from the earth­quake sen­sors. The mea­sure­ments may not be perfectly ac­cu­rate, but their failures are at least in­de­pen­dent—one sen­sor isn’t any more or less likely to be wrong when an­other sen­sor is wrong.

We can rep­re­sent this pic­ture with a causal di­a­gram:

From the di­a­gram, we can read off the model’s equa­tion: . We’re in­ter­ested mainly in the pos­te­rior prob­a­bil­ity or, in log odds form,

Stare at that equa­tion, and it’s not hard to see how the seis­mol­o­gist’s pro­ce­dure turns into a Naive Bayes model: the seis­mol­o­gist’s in­tu­itive scores for each sen­sor cor­re­spond to the “ev­i­dence” from the sen­sor . The “earth­quake score” then cor­re­sponds to the pos­te­rior log odds of an earth­quake. The seis­mol­o­gist has un­wit­tingly adopted a statis­ti­cal model. Note that this is still true re­gard­less of whether the scores used are well-cal­ibrated or whether the as­sump­tions of the model hold—the seis­mol­o­gist is im­plic­itly us­ing this model, and whether the model is cor­rect is an en­tirely sep­a­rate ques­tion.

The Embed­ded Naive Bayes Equation

Let’s for­mal­ize this a bit.

We have some sys­tem which takes in data , com­putes some stuff, and spits out some . We want to know whether a Naive Bayes model is em­bed­ded in . Con­cep­tu­ally, we imag­ine that pa­ram­e­ter­izes a prob­a­bil­ity dis­tri­bu­tion over some un­ob­served pa­ram­e­ter - we’ll write , where the “;” is read as “pa­ram­e­ter­ized by”. For in­stance, we could imag­ine a nor­mal dis­tri­bu­tion over , in which case might be the mean and var­i­ance (or any en­cod­ing thereof) com­puted from our in­put data. In our earth­quake ex­am­ple, is a bi­nary vari­able, so is just some en­cod­ing of the prob­a­bil­ity that .

Now let’s write the ac­tual equa­tion defin­ing an em­bed­ded Naive Bayes model. We as­sert that is the same as un­der the model, i.e.

We can trans­form to log odds form to get rid of the Z:

Let’s pause for a mo­ment and go through that equa­tion. We know the func­tion , and we want the equa­tion to hold for all val­ues of . is some hy­po­thet­i­cal thing out in the en­vi­ron­ment—we don’t know what it cor­re­sponds to, we just hy­poth­e­size that the sys­tem is mod­el­ling some­thing it can’t di­rectly ob­serve. As with , we want the equa­tion to hold for all val­ues of . The un­knowns in the equa­tion are the prob­a­bil­ity func­tions , and . To make it clear what’s go­ing on, let’s re­move the prob­a­bil­ity no­ta­tion for a mo­ment, and just use func­tions and , with writ­ten as a sub­script:

This is a func­tional equa­tion: for each value of , we want to find func­tions , , and a con­stant such that the equa­tion holds for all pos­si­ble val­ues. The solu­tions and can then be de­coded to give our prob­a­bil­ity func­tions and , while can be de­coded to give our prior . Each pos­si­ble -value cor­re­sponds to a differ­ent set of solu­tions , , .

This par­tic­u­lar func­tional equa­tion is a var­i­ant of Pex­ider’s equa­tion; you can read all about it in Aczel’s Func­tional Equa­tions and Their Ap­pli­ca­tions, chap­ter 3. For our pur­poses, the most im­por­tant point is: de­pend­ing on the func­tion , the equa­tion may or may not have a solu­tion. In other words, there is a mean­ingful sense in which some func­tions do em­bed a Naive Bayes model, and oth­ers do not. Our seis­mol­o­gist’s pro­ce­dure does em­bed a Naive Bayes model: let be the iden­tity func­tion, be zero, and , and we have a solu­tion to the em­bed­ding equa­tion with given by our seis­mol­o­gist’s add-all-the-scores calcu­la­tion (al­though this is not the only solu­tion). On the other hand, a pro­ce­dure com­put­ing for real-val­ued in­puts , , would not em­bed a Naive Bayes model: with this , the em­bed­ding equa­tion would not have any solu­tions.