# Bayes Questions

For the first time ever I’ve had rea­son to use for­mal Bayesian statis­tics in my work. I feel this is a cause for stream­ers and con­fetti.

How­ever, I’ve got a bit stuck on analysing my con­fi­dence lev­els and I thought what bet­ter place than less­wrong to check that I’m mak­ing sense. I’m not sure this is strictly what less­wrong is for but the site says that this is my own per­sonal blog so I guess its ok?! I can always be down-voted to hell if not!

***

I’m try­ing to calcu­late es­ti­mated life be­fore failure of a par­tic­u­lar com­po­nent.

We’ve done a num­ber of tests and the re­sults show a larger than ex­pected var­i­ance with some com­po­nents not failing even af­ter an ex­tended life­time. I’m try­ing to analyse the re­sults to see which prob­a­bil­is­tic failure dis­tri­bu­tion best suits the available data. I have three differ­ent dis­tri­bu­tions (Weibull, Log-nor­mal & Birn­baum-Saun­ders) each of which has a shape pa­ram­e­ter and a scale pa­ram­e­ter.

For each dis­tri­bu­tion I cre­ated a grid which sam­ples the pos­si­ble val­ues of these pa­ram­e­ters. I’ve given the pa­ram­e­ters a log uniform prior by giv­ing each sam­pled pa­ram­e­ter pair a uniform prior but sam­pling the pa­ram­e­ters ge­o­met­ri­cally (i.e. each sam­pled value of the pa­ram­e­ter is a fixed mul­ti­ple of the pre­vi­ous value). I’ve tried other pri­ors and the re­sults seem fairly ro­bust over choice of prior.

For com­po­nents which have failed, P(E|H) is the prob­a­bil­ity den­sity func­tion at the num­ber of hours be­fore failure.

For com­po­nents which get to a cer­tain age and do not fail, P(E|H) is 1 – the cu­mu­la­tive prob­a­bil­ity func­tion at this num­ber of hours.

This is im­ple­mented on a spread­sheet with a tab for each test re­sult. It up­dates the prior prob­a­bil­ity into a pos­te­rior prob­a­bil­ity and then uses this as the prior for the next tab. The re­sult is nor­mal­ised to give to­tal prob­a­bil­ity of 1.

Ini­tially I calcu­late the ex­pected life of the worst com­po­nent in 1,000,000. For this, I just use the in­verse cu­mu­la­tive prob­a­bil­ity func­tion with p=0.000001 and calcu­late this for all of the po­ten­tial prob­a­bil­ity dis­tri­bu­tions.

The re­sults of this calcu­la­tion are mul­ti­plied by the fi­nal prob­a­bil­ities of each dis­tri­bu­tion be­ing the cor­rect one. Then I sum this over the en­tire hy­poth­e­sis space to give the ex­pected life of the worst com­po­nent in a mil­lion.

So my first ques­tion – is all of the above sound or have I made some silly mis­take in my logic?

***

The part that I’m less con­fi­dent about is how to analyse my 95% con­fi­dence level of the life of the worst com­po­nent in 1,000,000.

The ob­vi­ous way to ap­proach this is that I should just calcu­late my ex­pected value for the worst com­po­nent in 20,000,000. Then, for any given mil­lion that I se­lect, I have a 5% chance of se­lect­ing the worst in 20,000,000. This is treat­ing 95% as my con­fi­dence from the over­all weighted model.

Alter­na­tively, I can treat the 95% as refer­ring to my con­fi­dence in which is the one cor­rect prob­a­bil­ity dis­tri­bu­tion. In this case, af­ter I nor­mal­ise my prob­a­bil­ities so that the sum of all of the hy­pothe­ses is 1, I start delet­ing the least likely hy­pothe­ses. I keep delet­ing un­likely hy­pothe­ses un­til the sum of all of the re­main­ing hy­pothe­ses is <0.95. The last hy­poth­e­sis which was deleted is the top end of my 95% con­fi­dence level.

Now if I calcu­late the ex­pected life of the worst com­po­nent in 1,000,000 for that in­di­vi­d­ual model I think I can ar­gue that this also rep­re­sents my 95% con­fi­dence level of the worst com­po­nent in 1,000,000 but in a differ­ent way.

Is ei­ther of these bet­ter than the other? Is there an al­ter­na­tive defi­ni­tion of con­fi­dence level which I should be us­ing?

The con­fi­dence-in-which-dis­tri­bu­tion ver­sion gives much more con­ser­va­tive an­swers, par­tic­u­larly when the num­ber of tests is small; the con­fi­dence-from-over­all-model is much more for­giv­ing of hav­ing fewer tests. Even af­ter only a cou­ple of tests the lat­ter gives a 95% con­fi­dence level rel­a­tively close to the ex­pected value, whereas the con­fi­dence-in-which-dis­tri­bu­tion ver­sion re­mains fur­ther away from the ex­pected value un­til a larger num­ber of tests is performed.

This seems to me to be more re­al­is­tic but I don’t have a proper ar­gu­ment to ac­tu­ally jus­tify a de­ci­sion ei­ther way.

Any help warmly wel­comed.

• I’ll walk through how I’d an­a­lyze this prob­lem, let me know if I haven’t an­swered your ques­tions by the end.

First, prob­lem struc­ture. You have three un­knowns, which you want to es­ti­mate from the data: a shape pa­ram­e­ter, a scale pa­ram­e­ter, and an in­di­ca­tor of which model is cor­rect.

“Which model is cor­rect” is prob­a­bly eas­iest to start with. I’m not com­pletely sure that I’ve fol­lowed what your spread­sheet is do­ing, but if I un­der­stand cor­rectly, then that’s prob­a­bly an overly-com­pli­cated way to tackle the prob­lem. You want , which will be de­ter­mined via Bayes’ Rule by and your prior prob­a­bil­ity for each model be­ing cor­rect. The prior is un­likely to mat­ter much un­less your dataset is tiny, so is the im­por­tant part. That’s an in­te­gral:

In your case, you’re ap­prox­i­mat­ing that in­te­gral with a grid over and ( is a short­hand for here). Rather than what­ever you’re do­ing with timesteps, you can prob­a­bly just take the product of , where is the life­time of the com­po­nent in your dataset, then sum over the grid. (If you are deal­ing with on­line data stream­ing in, then you would need to do the timestep thing.)

That takes care of the model part. Once that’s done, you’ll prob­a­bly find that one model is like a gazillion times more likely than the other two, and you can just throw out the other two.

On to the 95% CI for the worst part in a mil­lion.

The dis­tri­bu­tion you’re in­ter­ested in here is . is an or­der statis­tic. Its CDF is ba­si­cally the CDF for any old point raised to the power of 1000000; read up on it to see ex­actly what ex­pres­sion to use. So if we wanted to do this an­a­lyt­i­cally, we’d first com­pute via Bayes’ Rule:

… where both pieces on the right would in­volve our in­te­gral from ear­lier. Ba­si­cally, you imag­ine adding one more point to the dataset and see what that would do to . If we had a closed-form ex­pres­sion for that dis­tri­bu­tion, then we could just raise the CDF to the mil­lionth power, we’d get a closed-form ex­pres­sion for the mil­lionth or­der statis­tic, and from there we’d get a 95% CI in the usual way.

In prac­tice, that’s prob­a­bly difficult, so let’s talk about how to ap­prox­i­mate it nu­mer­i­cally.

First, the or­der statis­tic part. As long as we can sam­ple from the pos­te­rior dis­tri­bu­tion , that part’s easy: gen­er­ate a mil­lion sam­ples of , take the worst, and you have a sam­ple of . Re­peat that pro­cess a bunch of times to com­pute the 95% CI. (This is not the same as the worst com­po­nent in 20M, but it’s not any harder to code up.)

Next, the pos­te­rior dis­tri­bu­tion for . This is go­ing to be driven by two pieces: un­cer­tainty in the pa­ram­e­ters and , and ran­dom noise from the dis­tri­bu­tion it­self. If the dataset is large enough, then the un­cer­tainty in and will be small, so the dis­tri­bu­tion it­self will be the dom­i­nant term. In that case, we can just find the best-fit (i.e. max­i­mum a-pos­te­ri­ori) es­ti­mates of and , and then de­clare that is ap­prox­i­mately the stan­dard dis­tri­bu­tion (Weibull, log-nor­mal, etc) with those ex­act pa­ram­e­ters. Pre­sum­ably we can sam­ple from any of those dis­tri­bu­tions with known pa­ram­e­ters, so we go do the or­der statis­tic part and we’re done.

If the un­cer­tainty in and is not small enough to ig­nore, then the prob­lem gets more com­pli­cated—we’ll need to sam­ple from the pos­te­rior dis­tri­bu­tion . At that point we’re in the land of Laplace ap­prox­i­ma­tion and MCMC and all that jazz; I’m not go­ing to walk through it here, be­cause this com­ment is already re­ally long.

So one last thing to wrap it up. I wrote all that out be­cause it’s a great ex­am­ple prob­lem of how to Bayes, but there’s still a big prob­lem at the model level: the life­time of the mil­lionth-worst com­po­nent is prob­a­bly driven by qual­i­ta­tively differ­ent pro­cesses than the vast ma­jor­ity of other com­po­nents. If some weird thing hap­pens one time in 10000, and causes prob­lems in the com­po­nents, then a best-fit model of the whole dataset prob­a­bly won’t pick it up at all. Nice-look­ing dis­tri­bu­tions like Weibull or log-nor­mal just aren’t good at mod­el­ling two qual­i­ta­tively differ­ent be­hav­iors mixed into the same dataset. There’s prob­a­bly some stan­dard for­mal way of deal­ing with this kind of thing—I hear that “Rare Event Model­ling” is a thing, al­though I know noth­ing about it—but the fun­da­men­tal prob­lem is just get­ting any rele­vant data at all. If we only have a hun­dred thou­sand data points, and we think that mil­lionth-worst is driven by qual­i­ta­tively differ­ent pro­cesses, then we have zero data on the mil­lionth-worst, full stop. On the other hand, if we have a billion data points, then we can just throw out all but the worst few thou­sand and analyse only those.

• Thanks John, that’s just the kind of thing I was hop­ing for.

The point about be­havi­our for the very worst com­po­nents not fol­low­ing the same dis­tri­bu­tion as the pop­u­la­tion which I’m study­ing is a very good one.

I should have given a lit­tle more back­ground to the task.

Firstly, the task is to com­pare a new com­po­nent with an old one. Data is available for the old one and data is be­ing col­lected for the new one. The new one has much higher av­er­age life but also much higher var­i­ance be­tween sam­ples which led to the ques­tion as to which will have the worst one-in-a-mil­lion perfor­mance.

Se­condly, the num­ber of data points is very small. Time cost per data point is ~1 week. I’ll be lucky if I have 10 data points in to­tal. Essen­tially, I’m tasked to try to ap­prove the com­po­nent with the fewest pos­si­ble data points.

(This means that so far I don’t have a favoured model (al­though Weibull is lower than the other 2) and I also don’t have a firm im­pres­sion of μ and σ.)

I guess all I can achieve with limited data is to say that for com­po­nents which fit the typ­i­cal failure mechanism the worst one-in-a-mil­lion is bet­ter for the new com­po­nent than the old. How­ever, we don’t have any in­for­ma­tion on Black Swan events caused by a differ­ent failure mechanism for ei­ther com­po­nent and I definitely need to make this clear in any pre­sen­ta­tion I make on the sub­ject. With such a lack of data points I don’t think I can use rare event mod­el­ling to fill in the gaps.

I will have a look at the ideas about:

1. Tak­ing mul­ti­ple ran­dom sam­ples of a mil­lion to get a bet­ter es­ti­mate of CI. I’m pretty sure this is com­fortably within my limited pro­gram­ming skills.

(My limited pro­gram­ming skills are the main rea­son that my method is a bit long winded – do­ing it step by step was my way of try­ing to stop my­self mak­ing a mis­take which I didn’t no­tice. I did make a mis­take with my ap­pli­ca­tion of the scale pa­ram­e­ter for Birn­baum-Saun­ders at one point which this method helped me spot.)

2. Look­ing at how adding ex­tra data points af­fects my re­sult. I think I can use this to com­pare value of in­for­ma­tion with cost of in­for­ma­tion to get a good im­pres­sion of when we have enough.

• Hav­ing ~ten data points makes this way more in­ter­est­ing. That’s ex­actly the kind of prob­lem that I spe­cial­ize in.

For the log-nor­mal dis­tri­bu­tion, it should be pos­si­ble to do the in­te­gral for ex­plic­itly. The in­te­gral is tractable for the nor­mal dis­tri­bu­tion—it comes out pro­por­tional to a power of the sam­ple var­i­ance—so just log-trans­form the data and use that. If you write down the in­te­gral for nor­mal-dis­tributed data ex­plic­itly and plug it into wolfra­malpha or some­thing, it should be able to han­dle it. That would cir­cum­vent need­ing to sam­ple and .

I don’t know if there’s a cor­re­spond­ing closed form for Birn­baum-Saun­ders; I had never even heard of it be­fore this. The prob­lem is still suffi­ciently low-di­men­sional that it would definitely be tractable com­pu­ta­tion­ally, but it would prob­a­bly be a fair bit of work to code.

• (I thor­oughly en­joy/​en­joyed this ex­change and think I learned some things from it, and wanted to thank you for hav­ing this out in the open)

• Birn­baum-Saun­ders is an in­ter­est­ing one. For the pur­poses of fa­tigue anal­y­sis, the as­sump­tions which bring about the three mod­els are:

Weibull—nu­mer­ous failure modes (of similar failure speed) rac­ing to see which causes the com­po­nent to fail first

Log-nor­mal—Da­m­age per cy­cle is pro­por­tional to cur­rent damage

Birn­baum-Saun­ders—Da­m­age per cy­cle is nor­mally dis­tributed and in­de­pen­dent of pre­vi­ous damage

My en­g­ineer­ing gut says that this com­po­nent is prob­a­bly some­where be­tween Log-nor­mal and Birn­baum-Saun­ders (I think pro­por­tion­al­ity will de­cay as dam­age in­creases) which is maybe why I don’t have a clear win­ner yet.

***

I think I un­der­stand now where my origi­nal rea­son­ing was in­cor­rect when I was calcu­lat­ing the ex­pected worst in a mil­lion. I was just calcu­lat­ing worst in a mil­lion for each model and tak­ing a weighted av­er­age of the an­swers. This meant that bad val­ues from the out­lier po­ten­tial pdfs were mas­sively sup­pressed.

I’ve done some sam­pling of the worst in a mil­lion by re­peat­edly cre­at­ing 2 ran­dom num­bers from 0 to 1. I use the first to se­lect a μ,σ com­bi­na­tion based on the pos­te­rior for each pair. I use the sec­ond ran­dom num­ber as a p-value. I then icdf those val­ues to get an x.

Is this an ok sam­pling method? I’m not sure if I’m miss­ing some­thing or should be us­ing MCMC. I definitely need to read up on this stuff!

The worst in a mil­lion is cur­rently dom­i­nated by those oc­ca­sions where the prob­a­bil­ity dis­tri­bu­tion is an out­lier. In those cases the p-value doesn’t need to be par­tic­u­larly ex­treme to achieve low x.

I think my ini­tial es­ti­mates were based ei­ther mainly on un­cer­tainty in p or mainly on un­cer­tainty in μ,σ. The sam­pling method al­lows me to ac­count for un­cer­tainty in both which definitely makes more sense. The model seems to re­act sen­si­bly when I add po­ten­tial new data so I think I can as­sess much bet­ter now how many data points I re­quire.

• That sam­pling method sounds like it should work, as­sum­ing it’s all im­ple­mented cor­rectly (not sure what method you’re us­ing to sam­ple from the pos­te­rior dis­tri­bu­tion of , ).

Worst case in a mil­lion be­ing dom­i­nated by pa­ram­e­ter un­cer­tainty definitely makes sense, given the small sam­ple size and the rate at which those dis­tri­bu­tions fall off.

• For μ,σ I effec­tively cre­ated a quasi-cu­mu­la­tive dis­tri­bu­tion with the pa­ram­e­ter pairs as the x-axis.

μ1,σ1. μ2,σ1. μ3,σ1 … μ1,σ2. μ2,σ2. μ3,σ2 … μn,σm

The ran­dom num­ber defines the rele­vant point on the y-axis. From there I get the cor­re­spond­ing μ,σ pair from the x-axis.

If this method works I’ll prob­a­bly have to code the whole thing in­stead of us­ing a spread­sheet as I don’t have nearly enough μ,σ val­ues to get a good an­swer cur­rently.

• At what point is the data used?

• I use it to de­ter­mine the rel­a­tive prob­a­bil­ities of each μ,σ pair which in turn cre­ate the pseudo cdf.

• Ok, that sounds right.

• Can you pro­gram? In that case I highly recom­mend us­ing PyMC or Stan for this kind of work. There’s a pretty rich liter­a­ture and cul­ture of how to iter­a­tively im­prove these types of mod­els, and some tool sup­port around these spe­cific toolk­its.

• I can (a bit).

I have good pro­gram­ming skills com­pared to a typ­i­cal me­chan­i­cal en­g­ineer but poor com­pared to a typ­i­cal pro­gram­mer.

Is there any in­tro­duc­tory text on the the­ory which you would recom­mend (for­get­ting about the pro­gram­ming for the mo­ment)? I wouldn’t want to try to use a pro­gram­ming lan­guage where I didn’t un­der­stand the the­ory be­hind what I was ask­ing the pro­gram to do.

• The tech­nolo­gies I’m sug­gest­ing are just im­ple­men­ta­tions of Bayes, which is what you’re try­ing to do. There’s some the­ory as to *how* they do in­fer­ence (spe­cial ver­sions of MCMC ba­si­cally), but this is an “im­ple­men­ta­tion de­tail” to a de­gree. Here’s some refer­ences to get you started, though they’re mostly Stan-cen­tered http://​mc-stan.org/​users/​doc­u­men­ta­tion/​ex­ter­nal.html . If you want a bet­ter over­all pic­ture of the the­ory I re­ally like this https://​ben-lam­bert.com/​a-stu­dents-guide-to-bayesian-statis­tics/​ book, takes you from ba­sics all the way to Stan usage