Beyond Bayesians and Frequentists

(Note: this is cross-posted from my blog and also available in pdf here.)

If you are a newly ini­ti­ated stu­dent into the field of ma­chine learn­ing, it won’t be long be­fore you start hear­ing the words “Bayesian” and “fre­quen­tist” thrown around. Many peo­ple around you prob­a­bly have strong opinions on which is the “right” way to do statis­tics, and within a year you’ve prob­a­bly de­vel­oped your own strong opinions (which are sus­pi­ciously similar to those of the peo­ple around you, de­spite there be­ing a much greater var­i­ance of opinion be­tween differ­ent labs). In fact, now that the year is 2012 the ma­jor­ity of new grad­u­ate stu­dents are be­ing raised as Bayesi­ans (at least in the U.S.) with fre­quen­tists thought of as stodgy emer­i­tus pro­fes­sors stuck in their ways.

If you are like me, the pre­ced­ing set of facts will make you very un­easy. They will make you un­easy be­cause sim­ple pat­tern-match­ing—the strength of peo­ple’s opinions, the re­li­a­bil­ity with which these opinions split along age bound­aries and lab bound­aries, and the ridicule that each side lev­els at the other camp – makes the “Bayesi­ans vs. fre­quen­tists” de­bate look far more like poli­tics than like schol­arly dis­course. Of course, that alone does not nec­es­sar­ily prove any­thing; these dis­con­cert­ing similar­i­ties could just be co­in­ci­dences that I hap­pened to cherry-pick.

My next point, then, is that we are right to be un­easy, be­cause such de­bate makes us less likely to eval­u­ate the strengths and weak­nesses of both ap­proaches in good faith. This es­say is a push against that—I sum­ma­rize the jus­tifi­ca­tions for Bayesian meth­ods and where they fall short, show how fre­quen­tist ap­proaches can fill in some of their short­com­ings, and then pre­sent my per­sonal (though prob­a­bly woe­fully un­der-in­formed) guidelines for choos­ing which type of ap­proach to use.

Be­fore do­ing any of this, though, a bit of back­ground is in or­der...

1. Back­ground on Bayesi­ans and Frequentists

1.1. Three Levels of Argument

As An­drew Critch [6] in­sight­fully points out, the Bayesi­ans vs. fre­quen­tists de­bate is re­ally three de­bates at once, cen­ter­ing around one or more of the fol­low­ing ar­gu­ments:

  1. Whether to in­ter­pret sub­jec­tive be­liefs as probabilities

  2. Whether to in­ter­pret prob­a­bil­ities as sub­jec­tive be­liefs (as op­posed to asymp­totic fre­quen­cies)

  3. Whether a Bayesian or fre­quen­tist al­gorithm is bet­ter suited to solv­ing a par­tic­u­lar prob­lem.

Given my own re­search in­ter­ests, I will add a fourth ar­gu­ment:

4. Whether Bayesian or fre­quen­tist tech­niques are bet­ter suited to en­g­ineer­ing an ar­tifi­cial in­tel­li­gence.

An­drew Gel­man [9] has his own well-writ­ten es­say on the sub­ject, where he ex­pands on these dis­tinc­tions and pre­sents his own more nu­anced view.

Why are these ar­gu­ments so com­monly con­flated? I’m not en­tirely sure; I would guess it is for his­tor­i­cal rea­sons but I have so far been un­able to find said his­tor­i­cal rea­sons. What­ever the rea­sons, what this boils down to in the pre­sent day is that peo­ple of­ten form opinions on 1. and 2., which then in­fluence their an­swers to 3. and 4. This is not good, since 1. and 2. are philo­soph­i­cal in na­ture and difficult to re­solve cor­rectly, whereas 3. and 4. are of­ten much eas­ier to re­solve and ex­tremely im­por­tant to re­solve cor­rectly in prac­tice. Let me re-iter­ate: the Bayes vs. fre­quen­tist dis­cus­sion should cen­ter on the prac­ti­cal em­ploy­ment of the two meth­ods, or, if episte­mol­ogy must be dis­cussed, it should be clearly sep­a­rated from the day-to-day prac­ti­cal de­ci­sions. Aside from the difficul­ties with cor­rectly de­cid­ing episte­mol­ogy, the re­la­tion­ship be­tween generic episte­mol­ogy and spe­cific prac­tices in cut­ting-edge statis­ti­cal re­search is only via a long causal chain, and it should be com­pletely un­sur­pris­ing if Bayesian episte­mol­ogy leads to the em­ploy­ment of fre­quen­tist tools or vice versa.

For this rea­son and for rea­sons of space, I will spend the re­main­der of the es­say fo­cus­ing on statis­ti­cal al­gorithms rather than on in­ter­pre­ta­tions of prob­a­bil­ity. For those who re­ally want to dis­cuss in­ter­pre­ta­tions of prob­a­bil­ity, I will ad­dress that in a later es­say.

1.2. Re­cap of Bayesian De­ci­sion Theory

(What fol­lows will be re­view for many.) In Bayesian de­ci­sion the­ory, we as­sume that there is some un­der­ly­ing world state θ and a like­li­hood func­tion p(X1,...,Xn | θ) over pos­si­ble ob­ser­va­tions. (A like­li­hood func­tion is just a con­di­tional prob­a­bil­ity dis­tri­bu­tion where the pa­ram­e­ter con­di­tioned on can vary.) We also have a space A of pos­si­ble ac­tions and a util­ity func­tion U(θ; a) that gives the util­ity of perform­ing ac­tion a if the un­der­ly­ing world state is θ. We can in­cor­po­rate no­tions like plan­ning and value of in­for­ma­tion by defin­ing U(θ; a) re­cur­sively in terms of an iden­ti­cal agent to our­selves who has seen one ad­di­tional ob­ser­va­tion (or, if we are plan­ning against an ad­ver­sary, in terms of the ad­ver­sary). For a more de­tailed overview of this ma­te­rial, see the tu­to­rial by North [11].

What dis­t­in­guishes the Bayesian ap­proach in par­tic­u­lar is one ad­di­tional as­sump­tion, a prior dis­tri­bu­tion p(θ) over pos­si­ble world states. To make a de­ci­sion with re­spect to a given prior, we com­pute the pos­te­rior dis­tri­bu­tion ppos­te­rior(θ | X1,...,Xn) us­ing Bayes’ the­o­rem, then take the ac­tion a that max­i­mizes .

In prac­tice, ppos­te­rior(θ | X1,...,Xn) can be quite difficult to com­pute, and so we of­ten at­tempt to ap­prox­i­mate it. Such at­tempts are known as ap­prox­i­mate in­fer­ence al­gorithms.

1.3. Steel-man­ning Frequentists

There are many differ­ent ideas that fall un­der the broad um­brella of fre­quen­tist tech­niques. While it would be im­pos­si­ble to ad­e­quately sum­ma­rize all of them even if I at­tempted to, there are three in par­tic­u­lar that I would like to de­scribe, and which I will call fre­quen­tist de­ci­sion the­ory, fre­quen­tist guaran­tees, and fre­quen­tist anal­y­sis tools.

Fre­quen­tist de­ci­sion the­ory has a very similar setup to Bayesian de­ci­sion the­ory, with a few key differ­ences. Th­ese are dis­cussed in de­tail and con­trasted with Bayesian de­ci­sion the­ory in [10], al­though we sum­ma­rize the differ­ences here. There is still a like­li­hood func­tion p(X1,...,Xn | θ) and a util­ity func­tion U(θ; a). How­ever, we do not as­sume the ex­is­tence of a prior on θ, and in­stead choose the de­ci­sion rule a(X1,...,Xn) that maximizes

In other words, we ask for a worst case guaran­tee rather than an av­er­age case guaran­tee. As an ex­am­ple of how these would differ, imag­ine a sce­nario where we have no data to ob­serve, an un­known θ in {1,...,N}, and we choose an ac­tion a in {0,...,N}. Fur­ther­more, U(0; θ) = 0 for all θ, U(a; θ) = −1 if a = θ, and U(a;θ) = 1 if a ≠ 0 and a ≠ θ. Then a fre­quen­tist will always choose a = 0 be­cause any other ac­tion gets −1 util­ity in the worst case; a Bayesian, on the other hand, will hap­pily choose any non-zero value of a since such an ac­tion gains (N-2)/​N util­ity in ex­pec­ta­tion. (I am pur­posely ig­nor­ing more com­plex ideas like mixed strate­gies for the pur­pose of illus­tra­tion.).

Note that the fre­quen­tist op­ti­miza­tion prob­lem is more com­pli­cated than in the Bayesian case, since the value of (1) de­pends on the joint be­hav­ior of a(X1,...,Xn), whereas with Bayes we can op­ti­mize a(X1,...,Xn) for each set of ob­ser­va­tions sep­a­rately.

As a re­sult of this more com­plex op­ti­miza­tion prob­lem, it is of­ten not ac­tu­ally pos­si­ble to max­i­mize (1), so many fre­quen­tist tech­niques in­stead de­velop tools to lower-bound (1) for a given de­ci­sion pro­ce­dure, and then try to con­struct a de­ci­sion pro­ce­dure that is rea­son­ably close to the op­ti­mum. Sup­port vec­tor ma­chines [2], which try to pick sep­a­rat­ing hy­per­planes that min­i­mize gen­er­al­iza­tion er­ror, are one ex­am­ple of this where the al­gorithm is ex­plic­itly try­ing to max­i­mize worst-case util­ity. Another ex­am­ple of a fre­quen­tist de­ci­sion pro­ce­dure is L1-reg­u­larized least squares for sparse re­cov­ery [3], where the pro­ce­dure it­self does not look like it is ex­plic­itly max­i­miz­ing any util­ity func­tion, but a sep­a­rate anal­y­sis shows that it is close to the op­ti­mal pro­ce­dure any­ways.

The sec­ond sort of fre­quen­tist ap­proach to statis­tics is what I call a fre­quen­tist guaran­tee. A fre­quen­tist guaran­tee on an al­gorithm is a guaran­tee that, with high prob­a­bil­ity with re­spect to how the data was gen­er­ated, the out­put of the al­gorithm will satisfy a given prop­erty. The most fa­mil­iar ex­am­ple of this is any al­gorithm that gen­er­ates a fre­quen­tist con­fi­dence in­ter­val: to gen­er­ate a 95% fre­quen­tist con­fi­dence in­ter­val for a pa­ram­e­ter θ is to run an al­gorithm that out­puts an in­ter­val, such that with prob­a­bil­ity at least 95% θ lies within the in­ter­val. An im­por­tant fact about most such al­gorithms is that the size of the in­ter­val only grows log­a­r­ith­mi­cally with the amount of con­fi­dence we re­quire, so get­ting a 99.9999% con­fi­dence in­ter­val is only slightly harder than get­ting a 95% con­fi­dence in­ter­val (and we should prob­a­bly be ask­ing for the former when­ever pos­si­ble).

If we use such al­gorithms to test hy­pothe­ses or to test dis­crete prop­er­ties of θ, then we can ob­tain al­gorithms that take in prob­a­bil­is­ti­cally gen­er­ated data and pro­duce an out­put that with high prob­a­bil­ity de­pends only on how the data was gen­er­ated, not on the spe­cific ran­dom sam­ples that were given. For in­stance, we can cre­ate an al­gorithm that takes in sam­ples from two dis­tri­bu­tions, and is guaran­teed to out­put 1 when­ever they are the same, 0 when­ever they differ by at least ε in to­tal vari­a­tional dis­tance, and could have ar­bi­trary out­put if they are differ­ent but the to­tal vari­a­tional dis­tance is less than ε. This is an amaz­ing prop­erty—it takes in ran­dom in­put and pro­duces an es­sen­tially de­ter­minis­tic an­swer.

Fi­nally, a third type of fre­quen­tist ap­proach seeks to con­struct anal­y­sis tools for un­der­stand­ing the be­hav­ior of ran­dom vari­ables. Met­ric en­tropy, the Ch­er­noff and Azuma-Hoeffd­ing bounds [12], and Doob’s op­tional stop­ping the­o­rem are rep­re­sen­ta­tive ex­am­ples of this sort of ap­proach. Ar­guably, ev­ery­one with the time to spare should mas­ter these tech­niques, since be­ing able to an­a­lyze ran­dom vari­ables is im­por­tant no mat­ter what ap­proach to statis­tics you take. In­deed, fre­quen­tist anal­y­sis tools have no con­flict at all with Bayesian meth­ods—they sim­ply provide tech­niques for un­der­stand­ing the be­hav­ior of the Bayesian model.

2. Bayes vs. Other Methods

2.1. Jus­tifi­ca­tion for Bayes

We pre­sented Bayesian de­ci­sion the­ory above, but are there any rea­sons why we should ac­tu­ally use it? One com­monly-given rea­son is that Bayesian statis­tics is merely the ap­pli­ca­tion of Bayes’ The­o­rem, which, be­ing a the­o­rem, de­scribes the only cor­rect way to up­date be­liefs in re­sponse to new ev­i­dence; any­thing else can only be jus­tified to the ex­tent that it pro­vides a good ap­prox­i­ma­tion to Bayesian up­dat­ing. This may be true, but Bayes’ The­o­rem only ap­plies if we already have a prior, and if we ac­cept prob­a­bil­ity as the cor­rect frame­work for ex­press­ing un­cer­tain be­liefs. We might want to avoid one or both of these as­sump­tions. Bayes’ the­o­rem also doesn’t ex­plain why we care about ex­pected util­ity as op­posed to some other statis­tic of the dis­tri­bu­tion over util­ities (al­though note that fre­quen­tist de­ci­sion the­ory also tries to max­i­mize ex­pected util­ity).

One com­pel­ling an­swer to this is dutch-book­ing, which shows that any agent must im­plic­itly be us­ing a prob­a­bil­ity model to make de­ci­sions, or else there is a se­ries of bets that they would be will­ing to make that causes them to lose money with cer­tainty. Another an­swer is the com­plete class the­o­rem, which shows that any non-Bayesian de­ci­sion pro­ce­dure is strictly dom­i­nated by a Bayesian de­ci­sion pro­ce­dure—mean­ing that the Bayesian pro­ce­dure performs at least as well as the non-Bayesian pro­ce­dure in all cases with cer­tainty. In other words, if you are do­ing any­thing non-Bayesian, then ei­ther it is se­cretly a Bayesian pro­ce­dure or there is an­other pro­ce­dure that does strictly bet­ter than it. Fi­nally, the VNM Utility The­o­rem states that any agent with con­sis­tent prefer­ences over dis­tri­bu­tions of out­comes must be im­plic­itly max­i­miz­ing the ex­pected value of some scalar-val­ued func­tion, which we can then use as our choice of util­ity func­tion U. Th­ese the­o­rems, how­ever, ig­nore the is­sue of com­pu­ta­tion—while the best de­ci­sion pro­ce­dure may be Bayesian, the best com­pu­ta­tion­ally-effi­cient de­ci­sion pro­ce­dure could eas­ily be non-Bayesian.

Another jus­tifi­ca­tion for Bayes is that, in con­trast to ad hoc fre­quen­tist tech­niques, it ac­tu­ally pro­vides a gen­eral the­ory for con­struct­ing statis­ti­cal al­gorithms, as well as for in­cor­po­rat­ing side in­for­ma­tion such as ex­pert knowl­edge. In­deed, when try­ing to model com­plex and highly struc­tured situ­a­tions it is difficult to ob­tain any sort of fre­quen­tist guaran­tees (al­though anal­y­sis tools can still of­ten be ap­plied to gain in­tu­ition about parts of the model). A prior lets us write down the sorts of mod­els that would al­low us to cap­ture struc­tured situ­a­tions (for in­stance, when try­ing to do lan­guage mod­el­ing or trans­fer learn­ing). Non-Bayesian meth­ods ex­ist for these situ­a­tions, but they are of­ten ad hoc and in many cases ends up look­ing like an ap­prox­i­ma­tion to Bayes. One ex­am­ple of this is Kneser-Ney smooth­ing for n-gram mod­els, an ad hoc al­gorithm that ended up be­ing very similar to an ap­prox­i­mate in­fer­ence al­gorithm for the hi­er­ar­chi­cal Pit­man-Yor pro­cess [15, 14, 17, 8]. This raises an­other im­por­tant point against Bayes, which is that the proper Bayesian in­ter­pre­ta­tion may be very math­e­mat­i­cally com­plex. Pit­man-Yor pro­cesses are on the cut­ting-edge of Bayesian non­para­met­ric statis­tics, which is it­self one of the more tech­ni­cal sub­fields of statis­ti­cal ma­chine learn­ing, so it was prob­a­bly much eas­ier to come up with Kneser-Ney smooth­ing than to find the in­ter­pre­ta­tion in terms of Pit­man-Yor pro­cesses.

2.2. When the Jus­tifi­ca­tions Fail

The first and most com­mon ob­jec­tion to Bayes is that a Bayesian method is only as good as its prior. While for sim­ple mod­els the perfor­mance of Bayes is rel­a­tively in­de­pen­dent of the prior, such mod­els can only cap­ture data where fre­quen­tist tech­niques would also perform very well. For more com­plex (es­pe­cially non­para­met­ric) Bayesian mod­els, the perfor­mance can de­pend strongly on the prior, and de­sign­ing good pri­ors is still an open prob­lem. As one ex­am­ple I point to my own re­search on hi­er­ar­chi­cal non­para­met­ric mod­els, where the most straight­for­ward at­tempts to build a hi­er­ar­chi­cal model lead to se­vere patholo­gies [13].

Even if a Bayesian model does have a good prior, it may be com­pu­ta­tion­ally in­tractable to perform pos­te­rior in­fer­ence. For in­stance, struc­ture learn­ing in Bayesian net­works is NP-hard [4], as is topic in­fer­ence in the pop­u­lar la­tent Dirich­let al­lo­ca­tion model (and this con­tinues to hold even if we only want to perform ap­prox­i­mate in­fer­ence). Similar sto­ries prob­a­bly hold for other com­mon mod­els, al­though a the­o­ret­i­cal sur­vey has yet to be made; suffice to say that in prac­tice ap­prox­i­mate in­fer­ence re­mains a difficult and un­solved prob­lem, with many mod­els not even con­sid­ered be­cause of the ap­par­ent hope­less­ness of perform­ing in­fer­ence in them.

Be­cause fre­quen­tist meth­ods of­ten come with an anal­y­sis of the spe­cific al­gorithm be­ing em­ployed, they can some­times over­come these com­pu­ta­tional is­sues. One ex­am­ple of this men­tioned already is L1 reg­u­larized least squares [3]. The prob­lem setup is that we have a lin­ear re­gres­sion task Ax = b+v where A and b are known, v is a noise vec­tor, and x is be­lieved to be sparse (typ­i­cally x has many more rows than b, so with­out the spar­sity as­sump­tion x would be un­der­de­ter­mined). Let us sup­pose that x has n rows and k non-zero rows—then the num­ber of pos­si­ble spar­sity pat­terns is --- large enough that a brute force con­sid­er­a­tion of all pos­si­ble spar­sity pat­terns is in­tractable. How­ever, we can show that solv­ing a cer­tain semidefinite pro­gram will with high prob­a­bil­ity yield the ap­pro­pri­ate spar­sity pat­tern, af­ter which re­cov­er­ing x re­duces to a sim­ple least squares prob­lem. (A semidefinite pro­gram is a cer­tain type of op­ti­miza­tion prob­lem that can be solved effi­ciently [16].)

Fi­nally, Bayes has no good way of deal­ing with ad­ver­saries or with cases where the data was gen­er­ated in a com­pli­cated way that could make it highly bi­ased (for in­stance, as the out­put of an op­ti­miza­tion pro­ce­dure). A toy ex­am­ple of an ad­ver­sary would be play­ing rock-pa­per-scis­sors—how should a Bayesian play such a game? The straight­for­ward an­swer is to build up a model of the op­po­nent based on their plays so far, and then to make the play that max­i­mizes the ex­pected score (prob­a­bil­ity of win­ning minus prob­a­bil­ity of los­ing). How­ever, such a strat­egy fares poorly against any op­po­nent with ac­cess to the model be­ing used, as they can then just run the model them­selves to pre­dict the Bayesian’s plays in ad­vance, thereby win­ning ev­ery sin­gle time. In con­trast, there is a fre­quen­tist strat­egy called the mul­ti­plica­tive weights up­date method that fairs well against an ar­bi­trary op­po­nent (even one with su­pe­rior com­pu­ta­tional re­sources and ac­cess to our agent’s source code). The mul­ti­plica­tive weights method does far more than win­ning at rock-pa­per-scis­sors—it is also a key com­po­nent of the fastest al­gorithm for solv­ing many im­por­tant op­ti­miza­tion prob­lems (in­clud­ing the net­work flow al­gorithm), and it forms the the­o­ret­i­cal ba­sis for the widely used AdaBoost al­gorithm [1, 5, 7].

2.3. When To Use Each Method

The es­sen­tial differ­ence be­tween Bayesian and fre­quen­tist de­ci­sion the­ory is that Bayes makes the ad­di­tional as­sump­tion of a prior over θ, and op­ti­mizes for av­er­age-case perfor­mance rather than worst-case perfor­mance. It fol­lows, then, that Bayes is the su­pe­rior method when­ever we can ob­tain a good prior and when good av­er­age-case perfor­mance is suffi­cient. How­ever, if we have no way of ob­tain­ing a good prior, or when we need guaran­teed perfor­mance, fre­quen­tist meth­ods are the way to go. For in­stance, if we are try­ing to build a soft­ware pack­age that should be widely de­ploy­able, we might want to use a fre­quen­tist method be­cause users can be sure that the soft­ware will work as long as some num­ber of eas­ily-check­able as­sump­tions are met.

A nice mid­dle-ground be­tween purely Bayesian and purely fre­quen­tist meth­ods is to use a Bayesian model cou­pled with fre­quen­tist model-check­ing tech­niques; this gives us the free­dom in mod­el­ing af­forded by a prior but also gives us some de­gree of con­fi­dence that our model is cor­rect. This ap­proach is sug­gested by both Gel­man [9] and Jor­dan [10].

3. Conclusion

When the as­sump­tions of Bayes’ The­o­rem hold, and when Bayesian up­dat­ing can be performed com­pu­ta­tion­ally effi­ciently, then it is in­deed tau­tolog­i­cal that Bayes is the op­ti­mal ap­proach. Even when some of these as­sump­tions fail, Bayes can still be a fruit­ful ap­proach. How­ever, by work­ing un­der weaker (some­times even ad­ver­sar­ial) as­sump­tions, fre­quen­tist ap­proaches can perform well in very com­pli­cated do­mains even with fairly sim­ple mod­els; this is be­cause, with fewer as­sump­tions be­ing made at the out­set, less work has to be done to en­sure that those as­sump­tions are met.

From a re­search per­spec­tive, we should be far from satis­fied with ei­ther ap­proach—Bayesian meth­ods make stronger as­sump­tions than may be war­ranted, and fre­quen­tists meth­ods provide lit­tle in the way of a co­her­ent frame­work for con­struct­ing mod­els, and ask for worst-case guaran­tees, which prob­a­bly can­not be ob­tained in gen­eral. We should seek to de­velop a statis­ti­cal mod­el­ing frame­work that, un­like Bayes, can deal with un­known pri­ors, ad­ver­saries, and limited com­pu­ta­tional re­sources.

4. Acknowledgements

Thanks to Emma Pier­son, Vladimir Slep­nev, and Wei Dai for read­ing pre­limi­nary ver­sions of this work and pro­vid­ing many helpful com­ments.

5. Refer­ences

[1] San­jeev Arora, Elad Hazan, and Satyen Kale. The mul­ti­plica­tive weights up­date method: a meta al­gorithm and ap­pli­ca­tions. Work­ing Paper, 2005.

[2] Christo­pher J.C. Burges. A tu­to­rial on sup­port vec­tor ma­chines for pat­tern recog­ni­tion. Data Min­ing and Knowl­edge Dis­cov­ery, 2:121--167, 1998.

[3] Em­manuel J. Can­des. Com­pres­sive sam­pling. In Pro­ceed­ings of the In­ter­na­tional Congress of Math­e­mat­i­ci­ans. Euro­pean Math­e­mat­i­cal So­ciety, 2006.

[4] D.M. Chick­er­ing. Learn­ing bayesian net­works is NP-com­plete. LECTURE NOTES IN STATISTICS-NEW YORK-SPRINGER VERLAG-, pages 121--130, 1996.

[5] Paul Chris­ti­ano, Jonathan A. Kelner, Alek­sander Madry, Daniel Spiel­man, and Shang-Hua Teng. Elec­tri­cal flows, lapla­cian sys­tems, and faster ap­prox­i­ma­tion of max­i­mum flow in undi­rected graphs. In Pro­ceed­ings of the 43rd ACM Sym­po­sium on The­ory of Com­put­ing, 2011.

[6] An­drew Critch. Fre­quen­tist vs. bayesian break­down: In­ter­pre­ta­tion vs. in­fer­ence. http://​​less­​​lw/​​7ck/​​fre­quen­tist_vs_bayesian_break­down_in­ter­pre­ta­tion/​​.

[7] Yoav Fre­und and Robert E. Schapire. A short in­tro­duc­tion to boost­ing. Jour­nal of Ja­panese So­ciety for Ar­tifi­cial In­tel­li­gence, 14(5):771--780, Sep. 1999.

[8] J. Gasthaus and Y.W. Teh. Im­prove­ments to the se­quence mem­o­izer. In Ad­vances in Neu­ral In­for­ma­tion Pro­cess­ing Sys­tems, 2011.

[9] An­drew Gel­man. In­duc­tion and de­duc­tion in bayesian data anal­y­sis. RMM, 2:67--78, 2011.

[10] Michael I. Jor­dan. Are you a bayesian or a fre­quen­tist? Ma­chine Learn­ing Sum­mer School 2009 (video lec­ture at http://​​vide­olec­​​mlss09uk_jor­dan_bfway/​​).

[11] D. Warner North. A tu­to­rial in­tro­duc­tion to de­ci­sion the­ory. IEEE Trans­ac­tions on Sys­tems Science and Cy­ber­net­ics, SSC-4(3):200--210, Sep. 1968.

[12] Igal Sa­son. On re­fined ver­sions of the Azuma-Hoeffd­ing in­equal­ity with ap­pli­ca­tions in in­for­ma­tion the­ory. CoRR, abs/​1111.1977, 2011.

[13] Ja­cob Stein­hardt and Zoubin Ghahra­mani. Patholog­i­cal prop­er­ties of deep bayesian hi­er­ar­chies. In NIPS Work­shop on Bayesian Non­para­met­rics, 2011. Ex­tended Ab­stract.

[14] Y.W. Teh. A bayesian in­ter­pre­ta­tion of in­ter­po­lated Kneser-Ney. Tech­ni­cal Re­port TRA2/​06, School of Com­put­ing, NUS, 2006.

[15] Y.W. Teh. A hi­er­ar­chi­cal bayesian lan­guage model based on pit­man-yor pro­cesses. Col­ing/​ACL, 2006.

[16] Lieven Van­den­berghe and Stephen Boyd. Semidefinite pro­gram­ming. SIAM Re­view, 38(1):49--95, Mar. 1996.

[17] F.~Wood, C.~Ar­cham­beau, J.~Gasthaus, L.~James, and Y.W. Teh. A stochas­tic mem­o­izer for se­quence data. In Pro­ceed­ings of the 26th In­ter­na­tional Con­fer­ence on Ma­chine Learn­ing, pages 1129--1136, 2009.