# Embedded Naive Bayes

Sup­pose we have a bunch of earth­quake sen­sors spread over an area. They are not perfectly re­li­able (in terms of ei­ther false pos­i­tives or false nega­tives), but some are more re­li­able than oth­ers. How can we ag­gre­gate the sen­sor data to de­tect earth­quakes?

A “naive” seis­mol­o­gist with­out any statis­tics back­ground might try as­sign­ing differ­ent nu­mer­i­cal scores to each sen­sor, roughly in­di­cat­ing how re­li­able their pos­i­tive and nega­tive re­sults are, just based on the seis­mol­o­gist’s in­tu­ition. Sen­sor i gets a score when it’s go­ing off, and when it’s not. Then, the seis­mol­o­gist can add up the scores for all sen­sors go­ing off at a given time, plus the scores for sen­sors not go­ing off, to get an ag­gre­gate “earth­quake score”. As­sum­ing the seis­mol­o­gist has de­cent in­tu­itions for the sen­sors, this will prob­a­bly work just fine.

It turns out that this pro­ce­dure is equiv­a­lent to a Naive Bayes model.

Naive Bayes is a causal model in which there is some pa­ram­e­ter in the en­vi­ron­ment which we want to know about—i.e. whether or not there’s an earth­quake hap­pen­ing. We can’t ob­serve di­rectly, but we can mea­sure it in­di­rectly via some data - i.e. out­puts from the earth­quake sen­sors. The mea­sure­ments may not be perfectly ac­cu­rate, but their failures are at least in­de­pen­dent—one sen­sor isn’t any more or less likely to be wrong when an­other sen­sor is wrong.

We can rep­re­sent this pic­ture with a causal di­a­gram:

From the di­a­gram, we can read off the model’s equa­tion: . We’re in­ter­ested mainly in the pos­te­rior prob­a­bil­ity or, in log odds form,

Stare at that equa­tion, and it’s not hard to see how the seis­mol­o­gist’s pro­ce­dure turns into a Naive Bayes model: the seis­mol­o­gist’s in­tu­itive scores for each sen­sor cor­re­spond to the “ev­i­dence” from the sen­sor . The “earth­quake score” then cor­re­sponds to the pos­te­rior log odds of an earth­quake. The seis­mol­o­gist has un­wit­tingly adopted a statis­ti­cal model. Note that this is still true re­gard­less of whether the scores used are well-cal­ibrated or whether the as­sump­tions of the model hold—the seis­mol­o­gist is im­plic­itly us­ing this model, and whether the model is cor­rect is an en­tirely sep­a­rate ques­tion.

## The Embed­ded Naive Bayes Equation

Let’s for­mal­ize this a bit.

We have some sys­tem which takes in data , com­putes some stuff, and spits out some . We want to know whether a Naive Bayes model is em­bed­ded in . Con­cep­tu­ally, we imag­ine that pa­ram­e­ter­izes a prob­a­bil­ity dis­tri­bu­tion over some un­ob­served pa­ram­e­ter - we’ll write , where the “;” is read as “pa­ram­e­ter­ized by”. For in­stance, we could imag­ine a nor­mal dis­tri­bu­tion over , in which case might be the mean and var­i­ance (or any en­cod­ing thereof) com­puted from our in­put data. In our earth­quake ex­am­ple, is a bi­nary vari­able, so is just some en­cod­ing of the prob­a­bil­ity that .

Now let’s write the ac­tual equa­tion defin­ing an em­bed­ded Naive Bayes model. We as­sert that is the same as un­der the model, i.e.

We can trans­form to log odds form to get rid of the Z:

Let’s pause for a mo­ment and go through that equa­tion. We know the func­tion , and we want the equa­tion to hold for all val­ues of . is some hy­po­thet­i­cal thing out in the en­vi­ron­ment—we don’t know what it cor­re­sponds to, we just hy­poth­e­size that the sys­tem is mod­el­ling some­thing it can’t di­rectly ob­serve. As with , we want the equa­tion to hold for all val­ues of . The un­knowns in the equa­tion are the prob­a­bil­ity func­tions , and . To make it clear what’s go­ing on, let’s re­move the prob­a­bil­ity no­ta­tion for a mo­ment, and just use func­tions and , with writ­ten as a sub­script:

This is a func­tional equa­tion: for each value of , we want to find func­tions , , and a con­stant such that the equa­tion holds for all pos­si­ble val­ues. The solu­tions and can then be de­coded to give our prob­a­bil­ity func­tions and , while can be de­coded to give our prior . Each pos­si­ble -value cor­re­sponds to a differ­ent set of solu­tions , , .

This par­tic­u­lar func­tional equa­tion is a var­i­ant of Pex­ider’s equa­tion; you can read all about it in Aczel’s Func­tional Equa­tions and Their Ap­pli­ca­tions, chap­ter 3. For our pur­poses, the most im­por­tant point is: de­pend­ing on the func­tion , the equa­tion may or may not have a solu­tion. In other words, there is a mean­ingful sense in which some func­tions do em­bed a Naive Bayes model, and oth­ers do not. Our seis­mol­o­gist’s pro­ce­dure does em­bed a Naive Bayes model: let be the iden­tity func­tion, be zero, and , and we have a solu­tion to the em­bed­ding equa­tion with given by our seis­mol­o­gist’s add-all-the-scores calcu­la­tion (al­though this is not the only solu­tion). On the other hand, a pro­ce­dure com­put­ing for real-val­ued in­puts , , would not em­bed a Naive Bayes model: with this , the em­bed­ding equa­tion would not have any solu­tions.

• This seems in­ter­est­ing, but I think there are some as­sump­tions that have been left out? What pre­vents us from tak­ing and to be con­stants, which would give solu­tions for any ?

• I am in­deed leav­ing out some as­sump­tions, mainly be­cause I am not yet con­vinced of which as­sump­tions are “right”. The sim­plest as­sump­tion—used by Aczel—is that and are mono­tonic. But that’s usu­ally cho­sen more for math­e­mat­i­cal con­ve­nience than for any prin­ci­pled rea­son, as far as I can tell. We cer­tainly want some as­sump­tions which rule out the triv­ial solu­tion, but I’m not sure what they should be.

• What do you mean by em­bed­ded here? It seems you are ask­ing the ques­tion “does a par­tic­u­lar in­put-out­put be­hav­ior /​ com­pu­ta­tion cor­re­spond to some Naive Bayes model”, which is not what I would in­tu­itively think of as “em­bed­ded Naive Bayes”.

• Here’s the use case I have in mind. We have some neu­ral net­work or biolog­i­cal cell or some­thing perform­ing com­pu­ta­tion. It’s been op­ti­mized via gra­di­ent de­scent/​evolu­tion, and we have some out­side-view ar­gu­ments say­ing that op­ti­mal rea­son­ing should ap­prox­i­mate Bayesian in­fer­ence. We also know that the “true” be­hav­ior of the en­vi­ron­ment is causal—so op­ti­mal rea­son­ing for our sys­tem should ap­prox­i­mate Bayesian rea­son­ing on some causal model of the en­vi­ron­ment.

The prob­lem, then, is to go check whether the sys­tem ac­tu­ally is ap­prox­i­mat­ing Bayesian rea­son­ing over some causal model, and what that causal model is. In other words, we want to check whether the sys­tem has a par­tic­u­lar causal model (e.g. a Naive Bayes model) of its in­put data em­bed­ded within it.

What do you imag­ine “em­bed­ded” to mean?

• I usu­ally imag­ine the prob­lems of em­bed­ded agency (at least when I’m read­ing LW/​AF), where the cen­tral is­sue is that the agent is a part of its en­vi­ron­ment (in con­trast to the Carte­sian model, where there is a clear, bright line di­vid­ing the agent and the en­vi­ron­ment). Afaict, “em­bed­ded Naive Bayes” is some­thing that makes sense in a Carte­sian model, which I wasn’t ex­pect­ing.

It’s not that big a deal, but if you want to avoid that con­fu­sion, you might want to change the word “em­bed­ded”. I kind of want to say “The In­ten­tional Stance to­wards Naive Bayes”, but that’s not right ei­ther.

• Ok, that’s what I was figur­ing. My gen­eral po­si­tion is that the prob­lems of agents em­bed­ded in their en­vi­ron­ment re­duce to prob­lems of ab­strac­tion, i.e. world-mod­els em­bed­ded in com­pu­ta­tions which do not them­selves ob­vi­ously re­sem­ble world-mod­els. At some point I’ll prob­a­bly write that up in more de­tail, al­though the ar­gu­ment re­mains in­for­mal for now.

The im­me­di­ately im­por­tant point is that, while the OP makes sense in a Carte­sian model, it also makes sense with­out a Carte­sian model. We can just have some big com­pu­ta­tion, and pick a lit­tle chunk of it at ran­dom, and say “does this part here em­bed a Naive Bayes model?” In other words, it’s the sort of thing you could use to de­tect agenty sub­sys­tems, with­out hav­ing a Carte­sian bound­ary drawn in ad­vance.