How to Measure Anything

Dou­glas Hub­bard’s How to Mea­sure Any­thing is one of my fa­vorite how-to books. I hope this sum­mary in­spires you to buy the book; it’s worth it.

The book opens:

Any­thing can be mea­sured. If a thing can be ob­served in any way at all, it lends it­self to some type of mea­sure­ment method. No mat­ter how “fuzzy” the mea­sure­ment is, it’s still a mea­sure­ment if it tells you more than you knew be­fore. And those very things most likely to be seen as im­mea­surable are, vir­tu­ally always, solved by rel­a­tively sim­ple mea­sure­ment meth­ods.

The sci­ences have many es­tab­lished mea­sure­ment meth­ods, so Hub­bard’s book fo­cuses on the mea­sure­ment of “busi­ness in­tan­gibles” that are im­por­tant for de­ci­sion-mak­ing but tricky to mea­sure: things like man­age­ment effec­tive­ness, the “flex­i­bil­ity” to cre­ate new prod­ucts, the risk of bankruptcy, and pub­lic image.

Ba­sic Ideas

A mea­sure­ment is an ob­ser­va­tion that quan­ti­ta­tively re­duces un­cer­tainty. Mea­sure­ments might not yield pre­cise, cer­tain judg­ments, but they do re­duce your un­cer­tainty.

To be mea­sured, the ob­ject of mea­sure­ment must be de­scribed clearly, in terms of ob­serv­ables. A good way to clar­ify a vague ob­ject of mea­sure­ment like “IT se­cu­rity” is to ask “What is IT se­cu­rity, and why do you care?” Such prob­ing can re­veal that “IT se­cu­rity” means things like a re­duc­tion in unau­tho­rized in­tru­sions and malware at­tacks, which the IT de­part­ment cares about be­cause these things re­sult in lost pro­duc­tivity, fraud losses, and le­gal li­a­bil­ities.

Uncer­tainty is the lack of cer­tainty: the true out­come/​state/​value is not known.

Risk is a state of un­cer­tainty in which some of the pos­si­bil­ities in­volve a loss.

Much pes­simism about mea­sure­ment comes from a lack of ex­pe­rience mak­ing mea­sure­ments. Hub­bard, who is far more ex­pe­rienced with mea­sure­ment than his read­ers, says:

  1. Your prob­lem is not as unique as you think.

  2. You have more data than you think.

  3. You need less data than you think.

  4. An ad­e­quate amount of new data is more ac­cessible than you think.

Ap­plied In­for­ma­tion Economics

Hub­bard calls his method “Ap­plied In­for­ma­tion Eco­nomics” (AIE). It con­sists of 5 steps:

  1. Define a de­ci­sion prob­lem and the rele­vant vari­ables. (Start with the de­ci­sion you need to make, then figure out which vari­ables would make your de­ci­sion eas­ier if you had bet­ter es­ti­mates of their val­ues.)

  2. Deter­mine what you know. (Quan­tify your un­cer­tainty about those vari­ables in terms of ranges and prob­a­bil­ities.)

  3. Pick a vari­able, and com­pute the value of ad­di­tional in­for­ma­tion for that vari­able. (Re­peat un­til you find a vari­able with rea­son­ably high in­for­ma­tion value. If no re­main­ing vari­ables have enough in­for­ma­tion value to jus­tify the cost of mea­sur­ing them, skip to step 5.)

  4. Ap­ply the rele­vant mea­sure­ment in­stru­ment(s) to the high-in­for­ma­tion-value vari­able. (Then go back to step 3.)

  5. Make a de­ci­sion and act on it. (When you’ve done as much un­cer­tainty re­duc­tion as is eco­nom­i­cally jus­tified, it’s time to act!)

Th­ese steps are elab­o­rated be­low.

Step 1: Define a de­ci­sion prob­lem and the rele­vant variables

Hub­bard illus­trates this step by tel­ling the story of how he helped the Depart­ment of Veter­ans Af­fairs (VA) with a mea­sure­ment prob­lem.

The VA was con­sid­er­ing seven pro­posed IT se­cu­rity pro­jects. They wanted to know “which… of the pro­posed in­vest­ments were jus­tified and, af­ter they were im­ple­mented, whether im­prove­ments in se­cu­rity jus­tified fur­ther in­vest­ment…” Hub­bard asked his stan­dard ques­tions: “What do you mean by ‘IT se­cu­rity’? Why does it mat­ter to you? What are you ob­serv­ing when you ob­serve im­proved IT se­cu­rity?”

It be­came clear that no­body at the VA had thought about the de­tails of what “IT se­cu­rity” meant to them. But af­ter Hub­bard’s prob­ing, it be­came clear that by “IT se­cu­rity” they meant a re­duc­tion in the fre­quency and sever­ity of some un­de­sir­able events: agency-wide virus at­tacks, unau­tho­rized sys­tem ac­cess (ex­ter­nal or in­ter­nal),unau­tho­rized phys­i­cal ac­cess, and dis­asters af­fect­ing the IT in­fras­truc­ture (fire, flood, etc.) And each un­de­sir­able event was on the list be­cause of spe­cific costs as­so­ci­ated with it: pro­duc­tivity losses from virus at­tacks, le­gal li­a­bil­ity from unau­tho­rized sys­tem ac­cess, etc.

Now that the VA knew what they meant by “IT se­cu­rity,” they could mea­sure spe­cific vari­ables, such as the num­ber of virus at­tacks per year.

Step 2: Deter­mine what you know

Uncer­tainty and calibration

The next step is to de­ter­mine your level of un­cer­tainty about the vari­ables you want to mea­sure. To do this, you can ex­press a “con­fi­dence in­ter­val” (CI). A 90% CI is a range of val­ues that is 90% likely to con­tain the cor­rect value. For ex­am­ple, the se­cu­rity ex­perts at the VA were 90% con­fi­dent that each agency-wide virus at­tack would af­fect be­tween 25,000 and 65,000 peo­ple.

Un­for­tu­nately, few peo­ple are well-cal­ibrated es­ti­ma­tors. For ex­am­ple in some stud­ies, the true value lay in sub­jects’ 90% CIs only 50% of the time! Th­ese sub­jects were over­con­fi­dent. For a well-cal­ibrated es­ti­ma­tor, the true value will lie in her 90% CI roughly 90% of the time.

Luck­ily, “as­sess­ing un­cer­tainty is a gen­eral skill that can be taught with a mea­surable im­prove­ment.”

Hub­bard uses sev­eral meth­ods to cal­ibrate each client’s value es­ti­ma­tors, for ex­am­ple the se­cu­rity ex­perts at the VA who needed to es­ti­mate the fre­quency of se­cu­rity breaches and their likely costs.

His first tech­nique is the equiv­a­lent bet test. Sup­pose you’re asked to give a 90% CI for the year in which New­ton pub­lished the uni­ver­sal laws of grav­i­ta­tion, and you can win $1,000 in one of two ways:

  1. You win $1,000 if the true year of pub­li­ca­tion falls within your 90% CI. Other­wise, you win noth­ing.

  2. You spin a dial di­vided into two “pie slices,” one cov­er­ing 10% of the dial, and the other cov­er­ing 90%. If the dial lands on the small slice, you win noth­ing. If it lands on the big slice, you win $1,000.

If you find your­self prefer­ring op­tion #2, then you must think spin­ning the dial has a higher chance of win­ning you $1,000 than op­tion #1. That sug­gest your stated 90% CI isn’t re­ally your 90% CI. Maybe it’s your 65% CI or your 80% CI in­stead. By prefer­ring op­tion #2, your brain is try­ing to tell you that your origi­nally stated 90% CI is over­con­fi­dent.

If in­stead you find your­self prefer­ring op­tion #1, then you must think there is more than a 90% chance your stated 90% CI con­tains the true value. By prefer­ring op­tion #1, your brain is try­ing to tell you that your origi­nal 90% CI is un­der con­fi­dent.

To make a bet­ter es­ti­mate, ad­just your 90% CI un­til op­tion #1 and op­tion #2 seem equally good to you. Re­search sug­gests that even pre­tend­ing to bet money in this way will im­prove your cal­ibra­tion.

Hub­bard’s sec­ond method for im­prov­ing cal­ibra­tion is sim­ply rep­e­ti­tion and feed­back. Make lots of es­ti­mates and then see how well you did. For this, play CFAR’s Cal­ibra­tion Game.

Hub­bard also asks peo­ple to iden­tify rea­sons why a par­tic­u­lar es­ti­mate might be right, and why it might be wrong.

He also asks peo­ple to look more closely at each bound (up­per and lower) on their es­ti­mated range. A 90% CI “means there is a 5% chance the true value could be greater than the up­per bound, and a 5% chance it could be less than the lower bound. This means the es­ti­ma­tors must be 95% sure that the true value is less than the up­per bound. If they are not that cer­tain, they should in­crease the up­per bound… A similar test is ap­plied to the lower bound.”

Simulations

Once you de­ter­mine what you know about the un­cer­tain­ties in­volved, how can you use that in­for­ma­tion to de­ter­mine what you know about the risks in­volved? Hub­bard sum­ma­rizes:

…all risk in any pro­ject… can be ex­pressed by one method: the ranges of un­cer­tainty on the costs and benefits, and prob­a­bil­ities on events that might af­fect them.

The sim­plest tool for mea­sur­ing such risks ac­cu­rately is the Monte Carlo (MC) simu­la­tion, which can be run by Ex­cel and many other pro­grams. To illus­trate this tool, sup­pose you are won­der­ing whether to lease a new ma­chine for one step in your man­u­fac­tur­ing pro­cess.

The one-year lease [for the ma­chine] is $400,000 with no op­tion for early can­cel­la­tion. So if you aren’t break­ing even, you are still stuck with it for the rest of the year. You are con­sid­er­ing sign­ing the con­tract be­cause you think the more ad­vanced de­vice will save some la­bor and raw ma­te­ri­als and be­cause you think the main­te­nance cost will be lower than the ex­ist­ing pro­cess.

Your pre-cal­ibrated es­ti­ma­tors give their 90% CIs for the fol­low­ing vari­ables:

  • Main­te­nance sav­ings (MS): $10 to $20 per unit

  • La­bor sav­ings (LS): -$2 to $8 per unit

  • Raw ma­te­ri­als sav­ings (RMS): $3 to $9 per unit

  • Pro­duc­tion level (PL): 15,000 to 35,000 units per year

Thus, your an­nual sav­ings will equal (MS + LS + RMS) × PL.

When mea­sur­ing risk, we don’t just want to know the “av­er­age” risk or benefit. We want to know the prob­a­bil­ity of a huge loss, the prob­a­bil­ity of a small loss, the prob­a­bil­ity of a huge sav­ings, and so on. That’s what Monte Carlo can tell us.

An MC simu­la­tion uses a com­puter to ran­domly gen­er­ate thou­sands of pos­si­ble val­ues for each vari­able, based on the ranges we’ve es­ti­mated. The com­puter then calcu­lates the out­come (in this case, the an­nual sav­ings) for each gen­er­ated com­bi­na­tion of val­ues, and we’re able to see how of­ten differ­ent kinds of out­comes oc­cur.

To run an MC simu­la­tion we need not just the 90% CI for each vari­able but also the shape of each dis­tri­bu­tion. In many cases, the nor­mal dis­tri­bu­tion will work just fine, and we’ll use it for all the vari­ables in this sim­plified illus­tra­tion. (Hub­bard’s book shows you how to work with other dis­tri­bu­tions).

To make an MC simu­la­tion of a nor­mally dis­tributed vari­able in Ex­cel, we use this for­mula:

=norminv(rand(), mean, stan­dard de­vi­a­tion)

So the for­mula for the main­te­nance sav­ings vari­able should be:

=norminv(rand(), 15, (20–10)/​3.29)

Sup­pose you en­ter this for­mula on cell A1 in Ex­cel. To gen­er­ate (say) 10,000 val­ues for the main­te­nance sav­ings value, just (1) copy the con­tents of cell A1, (2) en­ter “A1:A10000” in the cell range field to se­lect cells A1 through A10000, and (3) paste the for­mula into all those cells.

Now we can fol­low this pro­cess in other columns for the other vari­ables, in­clud­ing a column for the “to­tal sav­ings” for­mula. To see how many rows made a to­tal sav­ings of $400,000 or more (break-even), use Ex­cel’s coun­tif func­tion. In this case, you should find that about 14% of the sce­nar­ios re­sulted in a sav­ings of less than $400,000 – a loss.

We can also make a his­togram (see right) to show how many of the 10,000 sce­nar­ios landed in each $100,000 in­cre­ment (of to­tal sav­ings). This is even more in­for­ma­tive, and tells us a great deal about the dis­tri­bu­tion of risk and benefits we might in­cur from in­vest­ing in the new ma­chine. (Down­load the full spread­sheet for this ex­am­ple here.)

The simu­la­tion con­cept can (and in high-value cases should) be car­ried be­yond this sim­ple MC simu­la­tion. The first step is to learn how to use a greater va­ri­ety of dis­tri­bu­tions in MC simu­la­tions. The sec­ond step is to deal with cor­re­lated (rather than in­de­pen­dent) vari­ables by gen­er­at­ing cor­re­lated ran­dom num­bers or by mod­el­ing what the vari­ables have in com­mon.

A more com­pli­cated step is to use a Markov simu­la­tion, in which the simu­lated sce­nario is di­vided into many time in­ter­vals. This is of­ten used to model stock prices, the weather, and com­plex man­u­fac­tur­ing or con­struc­tion pro­jects. Another more com­pli­cated step is to use an agent-based model, in which in­de­pen­dently-act­ing agents are simu­lated. This method is of­ten used for traf­fic simu­la­tions, in which each ve­hi­cle is mod­eled as an agent.

Step 3: Pick a vari­able, and com­pute the value of ad­di­tional in­for­ma­tion for that variable

In­for­ma­tion can have three kinds of value:

  1. In­for­ma­tion can af­fect peo­ple’s be­hav­ior (e.g. com­mon knowl­edge of germs af­fects san­i­ta­tion be­hav­ior).

  2. In­for­ma­tion can have its own mar­ket value (e.g. you can sell a book with use­ful in­for­ma­tion).

  3. In­for­ma­tion can re­duce un­cer­tainty about im­por­tant de­ci­sions. (This is what we’re fo­cus­ing on here.)

When you’re un­cer­tain about a de­ci­sion, this means there’s a chance you’ll make a non-op­ti­mal choice. The cost of a “wrong” de­ci­sion is the differ­ence be­tween the wrong choice and the choice you would have made with perfect in­for­ma­tion. But it’s too costly to ac­quire perfect in­for­ma­tion, so in­stead we’d like to know which de­ci­sion-rele­vant vari­ables are the most valuable to mea­sure more pre­cisely, so we can de­cide which mea­sure­ments to make.

Here’s a sim­ple ex­am­ple:

Sup­pose you could make $40 mil­lion profit if [an ad­ver­tise­ment] works and lose $5 mil­lion (the cost of the cam­paign) if it fails. Then sup­pose your cal­ibrated ex­perts say they would put a 40% chance of failure on the cam­paign.

The ex­pected op­por­tu­nity loss (EOL) for a choice is the prob­a­bil­ity of the choice be­ing “wrong” times the cost of it be­ing wrong. So for ex­am­ple the EOL if the cam­paign is ap­proved is $5M × 40% = $2M, and the EOL if the cam­paign is re­jected is $40M × 60% = $24M.

The differ­ence be­tween EOL be­fore and af­ter a mea­sure­ment is called the “ex­pected value of in­for­ma­tion” (EVI).

In most cases, we want to com­pute the VoI for a range of val­ues rather than a bi­nary suc­ceed/​fail. So let’s tweak the ad­ver­tis­ing cam­paign ex­am­ple and say that a cal­ibrated mar­ket­ing ex­pert’s 90% CI for sales re­sult­ing from the cam­paign was from 100,000 units to 1 mil­lion units. The risk is that we don’t sell enough units from this cam­paign to break even.

Sup­pose we profit by $25 per unit sold, so we’d have to sell at least 200,000 units from the cam­paign to break even (on a $5M cam­paign). To be­gin, let’s calcu­late the ex­pected value of perfect in­for­ma­tion (EVPI), which will provide an up­per bound on how much we should spend to re­duce our un­cer­tainty about how many units will be sold as a re­sult of the cam­paign. Here’s how we com­pute it:

  1. Slice the dis­tri­bu­tion of our vari­able into thou­sands of small seg­ments.

  2. Com­pute the EOL for each seg­ment. EOL = seg­ment mid­point times seg­ment prob­a­bil­ity.

  3. Sum the prod­ucts from step 2 for all seg­ments.

Of course, we’ll do this with a com­puter. For the de­tails, see Hub­bard’s book and the Value of In­for­ma­tion spread­sheet from his web­site.

In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to re­duce our un­cer­tainty about how many units will be sold as a re­sult of the cam­paign.

And in fact, we should prob­a­bly spend much less than $337,000, be­cause no mea­sure­ment we make will give us perfect in­for­ma­tion. For more de­tails on how to mea­sure the value of im­perfect in­for­ma­tion, see Hub­bard’s book and these three LessWrong posts: (1) VoI: 8 Ex­am­ples, (2) VoI: Four Ex­am­ples, and (3) 5-sec­ond level case study: VoI.

I do, how­ever, want to quote Hub­bard’s com­ments about the “mea­sure­ment in­ver­sion”:

By 1999, I had com­pleted the… Ap­plied In­for­ma­tion Eco­nomics anal­y­sis on about 20 ma­jor [IT] in­vest­ments… Each of these busi­ness cases had 40 to 80 vari­ables, such as ini­tial de­vel­op­ment costs, adop­tion rate, pro­duc­tivity im­prove­ment, rev­enue growth, and so on. For each of these busi­ness cases, I ran a macro in Ex­cel that com­puted the in­for­ma­tion value for each vari­able… [and] I be­gan to see this pat­tern: * The vast ma­jor­ity of vari­ables had an in­for­ma­tion value of zero… * The vari­ables that had high in­for­ma­tion val­ues were rou­tinely those that the client had never mea­sured… * The vari­ables that clients [spent] the most time mea­sur­ing were usu­ally those with a very low (even zero) in­for­ma­tion value… …since then, I’ve ap­plied this same test to an­other 40 pro­jects, and… [I’ve] no­ticed the same phe­nom­ena arise in pro­jects re­lat­ing to re­search and de­vel­op­ment, mil­i­tary lo­gis­tics, the en­vi­ron­ment, ven­ture cap­i­tal, and fa­cil­ities ex­pan­sion.

Hub­bard calls this the “Mea­sure­ment In­ver­sion”:

In a busi­ness case, the eco­nomic value of mea­sur­ing a vari­able is usu­ally in­versely pro­por­tional to how much mea­sure­ment at­ten­tion it usu­ally gets.

Here is one ex­am­ple:

A stark illus­tra­tion of the Mea­sure­ment In­ver­sion for IT pro­jects can be seen in a large UK-based in­surance client of mine that was an avid user of a soft­ware com­plex­ity mea­sure­ment method called “func­tion points.” This method was pop­u­lar in the 1980s and 1990s as a ba­sis of es­ti­mat­ing the effort for large soft­ware de­vel­op­ment efforts. This or­ga­ni­za­tion had done a very good job of track­ing ini­tial es­ti­mates, func­tion point es­ti­mates, and ac­tual effort ex­pended for over 300 IT pro­jects. The es­ti­ma­tion re­quired three or four full-time per­sons as “cer­tified” func­tion point coun­ters…

But a very in­ter­est­ing pat­tern arose when I com­pared the func­tion point es­ti­mates to the ini­tial es­ti­mates pro­vided by pro­ject man­agers… The costly, time-in­ten­sive func­tion point count­ing did change the ini­tial es­ti­mate but, on av­er­age, it was no closer to the ac­tual pro­ject effort than the ini­tial effort… Not only was this the sin­gle largest mea­sure­ment effort in the IT or­ga­ni­za­tion, it liter­ally added no value since it didn’t re­duce un­cer­tainty at all. Cer­tainly, more em­pha­sis on mea­sur­ing the benefits of the pro­posed pro­jects – or al­most any­thing else – would have been bet­ter money spent.

Hence the im­por­tance of calcu­lat­ing EVI.

Step 4: Ap­ply the rele­vant mea­sure­ment in­stru­ment(s) to the high-in­for­ma­tion-value variable

If you fol­lowed the first three steps, then you’ve defined a vari­able you want to mea­sure in terms of the de­ci­sion it af­fects and how you ob­serve it, you’ve quan­tified your un­cer­tainty about it, and you’ve calcu­lated the value of gain­ing ad­di­tional in­for­ma­tion about it. Now it’s time to re­duce your un­cer­tainty about the vari­able – that is, to mea­sure it.

Each sci­en­tific dis­ci­pline has its own spe­cial­ized mea­sure­ment meth­ods. Hub­bard’s book de­scribes mea­sure­ment meth­ods that are of­ten use­ful for re­duc­ing our un­cer­tainty about the “softer” top­ics of­ten en­coun­tered by de­ci­sion-mak­ers in busi­ness.

Select­ing a mea­sure­ment method

To figure out which cat­e­gory of mea­sure­ment meth­ods are ap­pro­pri­ate for a par­tic­u­lar case, we must ask sev­eral ques­tions:

  1. De­com­po­si­tion: Which parts of the thing are we un­cer­tain about?

  2. Se­condary re­search: How has the thing (or its parts) been mea­sured by oth­ers?

  3. Ob­ser­va­tion: How do the iden­ti­fied ob­serv­ables lend them­selves to mea­sure­ment?

  4. Mea­sure just enough: How much do we need to mea­sure it?

  5. Con­sider the er­ror: How might our ob­ser­va­tions be mis­lead­ing?

Decomposition

Some­times you’ll want to start by de­com­pos­ing an un­cer­tain vari­able into sev­eral parts to iden­tify which ob­serv­ables you can most eas­ily mea­sure. For ex­am­ple, rather than di­rectly es­ti­mat­ing the cost of a large con­struc­tion pro­ject, you could break it into parts and es­ti­mate the cost of each part of the pro­ject.

In Hub­bard’s ex­pe­rience, it’s of­ten the case that de­com­po­si­tion it­self – even with­out mak­ing any new mea­sure­ments – of­ten re­duces one’s un­cer­tainty about the vari­able of in­ter­est.

Se­condary research

Don’t rein­vent the world. In al­most all cases, some­one has already in­vented the mea­sure­ment tool you need, and you just need to find it. Here are Hub­bard’s tips on sec­ondary re­search:

  1. If you’re new to a topic, start with Wikipe­dia rather than Google. Wikipe­dia will give you a more or­ga­nized per­spec­tive on the topic at hand.

  2. Use search terms of­ten as­so­ci­ated with quan­ti­ta­tive data. E.g. don’t just search for “soft­ware qual­ity” or “cus­tomer per­cep­tion” – add terms like “table,” “sur­vey,” “con­trol group,” and “stan­dard de­vi­a­tion.”

  3. Think of in­ter­net re­search in two lev­els: gen­eral search en­g­ines and topic-spe­cific repos­i­to­ries (e.g. the CIA World Fact Book).

  4. Try mul­ti­ple search en­g­ines.

  5. If you find marginally re­lated re­search that doesn’t di­rectly ad­dress your topic of in­ter­est, check the bibliog­ra­phy more rele­vant read­ing ma­te­rial.

I’d also recom­mend my post Schol­ar­ship: How to Do It Effi­ciently.

Observation

If you’re not sure how to mea­sure your tar­get vari­able’s ob­serv­ables, ask these ques­tions:

  1. Does it leave a trail? Ex­am­ple: longer waits on cus­tomer sup­port lines cause cus­tomers to hang up and not call back. Maybe you can also find a cor­re­la­tion be­tween cus­tomers who hang up af­ter long waits and re­duced sales to those cus­tomers.

  2. Can you ob­serve it di­rectly? Maybe you haven’t been track­ing how many of the cus­tomers in your park­ing lot show an out-of-state li­cense, but you could start. Or at least, you can ob­serve a sam­ple of these data.

  3. Can you cre­ate a way to ob­serve it in­di­rectly? Ama­zon.com added a gift-wrap­ping fea­ture in part so they could bet­ter track how many books were be­ing pur­chased as gifts. Another ex­am­ple is when con­sumers are given coupons so that re­tailers can see which news­pa­pers their cus­tomers read.

  4. Can the thing be forced to oc­cur un­der new con­di­tions which al­low you to ob­serve it more eas­ily? E.g. you could im­ple­ment a pro­posed re­turned-items policy in some stores but not oth­ers and com­pare the out­comes.

Mea­sure just enough

Be­cause ini­tial mea­sure­ments of­ten tell you quite a lot, and also change the value of con­tinued mea­sure­ment, Hub­bard of­ten aims for spend­ing 10% of the EVPI on a mea­sure­ment, and some­times as lit­tle as 2% (es­pe­cially for very large pro­jects).

Con­sider the error

It’s im­por­tant to be con­scious of some com­mon ways in which mea­sure­ments can mis­lead.

Scien­tists dis­t­in­guish two types of mea­sure­ment er­ror: sys­temic and ran­dom. Ran­dom er­rors are ran­dom vari­a­tions from one ob­ser­va­tion to the next. They can’t be in­di­vi­d­u­ally pre­dicted, but they fall into pat­terns that can be ac­counted for with the laws of prob­a­bil­ity. Sys­temic er­rors, in con­trast, are con­sis­tent. For ex­am­ple, the sales staff may rou­tinely over­es­ti­mate the next quar­ter’s rev­enue by 50% (on av­er­age).

We must also dis­t­in­guish pre­ci­sion and ac­cu­racy. A “pre­cise” mea­sure­ment tool has low ran­dom er­ror. E.g. if a bath­room scale gives the ex­act same dis­played weight ev­ery time we set a par­tic­u­lar book on it, then the scale has high pre­ci­sion. An “ac­cu­rate” mea­sure­ment tool has low sys­temic er­ror. The bath­room scale, while pre­cise, might be in­ac­cu­rate if the weight dis­played is sys­tem­i­cally bi­ased in one di­rec­tion – say, eight pounds too heavy. A mea­sure­ment tool can also have low pre­ci­sion but good ac­cu­racy, if it gives in­con­sis­tent mea­sure­ments but they av­er­age to the true value.

Ran­dom er­ror tends to be eas­ier to han­dle. Con­sider this ex­am­ple:

For ex­am­ple, to de­ter­mine how much time sales reps spend in meet­ings with clients ver­sus other ad­minis­tra­tive tasks, they might choose a com­plete re­view of all time sheets… [But] if a com­plete re­view of 5,000 time sheets… tells us that sales reps spend 34% of their time in di­rect com­mu­ni­ca­tion with cus­tomers, we still don’t know how far from the truth it might be. Still, this “ex­act” num­ber seems re­as­sur­ing to many man­agers. Now, sup­pose a sam­ple of di­rect ob­ser­va­tions of ran­domly cho­sen sales reps at ran­dom points in time finds that sales reps were in client meet­ings or on client phone calls only 13 out of 100 of those in­stances. (We can com­pute this with­out in­ter­rupt­ing a meet­ing by ask­ing as soon as the rep is available.) As we will see [later], in the lat­ter case, we can statis­ti­cally com­pute a 90% CI to be 7.5% to 18.5%. Even though this ran­dom sam­pling ap­proach gives us only a range, we should pre­fer its find­ings to the cen­sus au­dit of time sheets. The cen­sus… gives us an ex­act num­ber, but we have no way to know by how much and in which di­rec­tion the time sheets err.

Sys­temic er­ror is also called a “bias.” Based on his ex­pe­rience, Hub­bard sus­pects the three most im­por­tant to avoid are:

  1. Con­fir­ma­tion bias: peo­ple see what they want to see.

  2. Selec­tion bias: your sam­ple might not be rep­re­sen­ta­tive of the group you’re try­ing to mea­sure.

  3. Ob­server bias: the very act of ob­ser­va­tion can af­fect what you ob­serve. E.g. in one study, re­searchers found that worker pro­duc­tivity im­proved no mat­ter what they changed about the work­place. The work­ers seem to have been re­spond­ing merely to the fact that they were be­ing ob­served in some way.

Choose and de­sign the mea­sure­ment instrument

After fol­low­ing the above steps, Hub­bard writes, “the mea­sure­ment in­stru­ment should be al­most com­pletely formed in your mind.” But if you still can’t come up with a way to mea­sure the tar­get vari­able, here are some ad­di­tional tips:

  1. Work through the con­se­quences. If the value is sur­pris­ingly high, or sur­pris­ingly low, what would you ex­pect to see?

  2. Be iter­a­tive. Start with just a few ob­ser­va­tions, and then re­calcu­late the in­for­ma­tion value.

  3. Con­sider mul­ti­ple ap­proaches. Your first mea­sure­ment tool may not work well. Try oth­ers.

  4. What’s the re­ally sim­ple ques­tion that makes the rest of the mea­sure­ment moot? First see if you can de­tect any change in re­search qual­ity be­fore try­ing to mea­sure it more com­pre­hen­sively.

Sam­pling reality

In most cases, we’ll es­ti­mate the val­ues in a pop­u­la­tion by mea­sur­ing the val­ues in a small sam­ple from that pop­u­la­tion. And for rea­sons dis­cussed in chap­ter 7, a very small sam­ple can of­ten offer large re­duc­tions in un­cer­tainty.

There are a va­ri­ety of tools we can use to build our es­ti­mates from small sam­ples, and which one we should use of­ten de­pends on how out­liers are dis­tributed in the pop­u­la­tion. In some cases, out­liers are very close to the mean, and thus our es­ti­mate of the mean can con­verge quickly on the true mean as we look at new sam­ples. In other cases, out­liers can be sev­eral or­ders of mag­ni­tude away from the mean, and our es­ti­mate con­verges very slowly or not at all. Here are some ex­am­ples:

  • Very quick con­ver­gence, only 1–2 sam­ples needed: choles­terol level of your blood, pu­rity of pub­lic wa­ter sup­ply, weight of jelly beans.

  • Usu­ally quickly con­ver­gence, 5–30 sam­ples needed: Per­centage of cus­tomers who like the new product, failure loads of bricks, age of your cus­tomers, how many movies peo­ple see in a year.

  • Po­ten­tially slow con­ver­gence: Soft­ware pro­ject cost over­runs, fac­tory down­time due to an ac­ci­dent.

  • Maybe non-con­ver­gent: Mar­ket value of cor­po­ra­tions, in­di­vi­d­ual lev­els of in­come, ca­su­alties of wars, size of vol­canic erup­tions.

Below, I sur­vey just a few of the many sam­pling meth­ods Hub­bard cov­ers in his book.

Math­less estimation

When work­ing with a quickly con­verg­ing phe­nomenon and a sym­met­ric dis­tri­bu­tion (uniform, nor­mal, camel-back, or bow-tie) for the pop­u­la­tion, you can use the t-statis­tic to de­velop a 90% CI even when work­ing with very small sam­ples. (See the book for in­struc­tions.)

Or, even eas­ier, make use of the Rule of FIve: “There is a 93.75% chance that the me­dian of a pop­u­la­tion is be­tween the small­est and largest val­ues in any ran­dom sam­ple of five from that pop­u­la­tion.”

The Rule of Five has an­other ad­van­tage over the t-statis­tic: it works for any dis­tri­bu­tion of val­ues in the pop­u­la­tion, in­clud­ing ones with slow con­ver­gence or no con­ver­gence at all! It can do this be­cause it gives us a con­fi­dence in­ter­val for the me­dian rather than the mean, and it’s the mean that is far more af­fected by out­liers.

Hub­bard calls this a “math­less” es­ti­ma­tion tech­nique be­cause it doesn’t re­quire us to take square roots or calcu­late stan­dard de­vi­a­tion or any­thing like that. More­over, this math­less tech­nique ex­tends be­yond the Rule of Five: If we sam­ple 8 items, there is a 99.2% chance that the me­dian of the pop­u­la­tion falls within the largest and small­est val­ues. If we take the 2nd largest and small­est val­ues (out of 8 to­tal val­ues), we get some­thing close to a 90% CI for the me­dian. Hub­bard gen­er­al­izes the tool with this handy refer­ence table:

And if the dis­tri­bu­tion is sym­met­ri­cal, then the math­less table gives us a 90% CI for the mean as well as for the me­dian.

Catch-recatch

How does a biol­o­gist mea­sure the num­ber of fish in a lake? SHe catches and tags a sam­ple of fish – say, 1000 of them – and then re­leases them. After the fish have had time to spread amongst the rest of the pop­u­la­tion, she’ll catch an­other sam­ple of fish. Sup­pose she caught 1000 fish again, and 50 of them were tagged. This would mean 5% of the fish were tagged, and thus that were about 20,000 fish in the en­tire lake. (See Hub­bard’s book for the de­tails on how to calcu­late the 90% CI.)

Spot sampling

The fish ex­am­ple was a spe­cial case of a com­mon prob­lem: pop­u­la­tion pro­por­tion sam­pling. Often, we want to know what pro­por­tion of a pop­u­la­tion has a par­tic­u­lar trait. How many reg­istered vot­ers in Cal­ifor­nia are Democrats? What per­centage of your cus­tomers pre­fer a new product de­sign over the old one?

Hub­bard’s book dis­cusses how to solve the gen­eral prob­lem, but for now let’s just con­sider an­other spe­cial case: spot sam­pling.

In spot sam­pling, you take ran­dom snap­shots of things rather than track­ing them con­stantly. What pro­por­tion of their work hours do em­ploy­ees spend on Face­book? To an­swer this, you “ran­domly sam­ple peo­ple through the day to see what they were do­ing at that mo­ment. If you find that in 12 in­stances out of 100 ran­dom sam­ples” em­ploy­ees were on Face­book, you can guess they spend about 12% of their time on Face­book (the 90% CI is 8% to 18%).

Clus­tered sampling

Hub­bard writes:

“Clus­tered sam­pling” is defined as tak­ing a ran­dom sam­ple of groups, then con­duct­ing a cen­sus or a more con­cen­trated sam­pling within the group. For ex­am­ple, if you want to see what share of house­holds has satel­lite dishes… it might be cost effec­tive to ran­domly choose sev­eral city blocks, then con­duct a com­plete cen­sus of ev­ery­thing in a block. (Zigzag­ging across town to in­di­vi­d­u­ally se­lected house­holds would be time con­sum­ing.) In such cases, we can’t re­ally con­sider the num­ber of [house­holds] in the groups… to be the num­ber of ran­dom sam­ples. Within a block, house­holds may be very similar… [and there­fore] it might be nec­es­sary to treat the effec­tive num­ber of ran­dom sam­ples as the num­ber of blocks…

Mea­sure to the threshold

For many de­ci­sions, one de­ci­sion is re­quired if a value is above some thresh­old, and an­other de­ci­sion is re­quired if that value is be­low the thresh­old. For such de­ci­sions, you don’t care as much about a mea­sure­ment that re­duces un­cer­tainty in gen­eral as you do about a mea­sure­ment that tells you which de­ci­sion to make based on the thresh­old. Hub­bard gives an ex­am­ple:

Sup­pose you needed to mea­sure the av­er­age amount of time spent by em­ploy­ees in meet­ings that could be con­ducted re­motely… If a meet­ing is among staff mem­bers who com­mu­ni­cate reg­u­larly and for a rel­a­tively rou­tine topic, but some­one has to travel to make the meet­ing, you prob­a­bly can con­duct it re­motely. You start out with your cal­ibrated es­ti­mate that the me­dian em­ployee spends be­tween 3% to 15% trav­el­ing to meet­ings that could be con­ducted re­motely. You de­ter­mine that if this per­centage is ac­tu­ally over 7%, you should make a sig­nifi­cant in­vest­ment in tele meet­ings. The [EVPI] calcu­la­tion shows that it is worth no more than $15,000 to study this. Ac­cord­ing to our rule of thumb for mea­sure­ment costs, we might try to spend about $1,500…

Let’s say you sam­pled 10 em­ploy­ees and… you find that only 1 spends less time in these ac­tivi­ties than the 7% thresh­old. Given this in­for­ma­tion, what is the chance that the me­dian time spent in such ac­tivi­ties is ac­tu­ally be­low 7%, in which case the in­vest­ment would not be jus­tified? One “com­mon sense” an­swer is 110, or 10%. Ac­tu­ally… the real chance is much smaller.

Hub­bard shows how to de­rive the real chance in his book. The key point is that “the un­cer­tainty about the thresh­old can fall much faster than the un­cer­tainty about the quan­tity in gen­eral.”

Re­gres­sion modeling

What if you want to figure out the cause of some­thing that has many pos­si­ble causes? One method is to perform a con­trol­led ex­per­i­ment, and com­pare the out­comes of a test group to a con­trol group. Hub­bard dis­cusses this in his book (and yes, he’s a Bayesian, and a skep­tic of p-value hy­poth­e­sis test­ing). For this sum­mary, I’ll in­stead men­tion an­other method for iso­lat­ing causes: re­gres­sion mod­el­ing. Hub­bard ex­plains:

If we use re­gres­sion mod­el­ing with his­tor­i­cal data, we may not need to con­duct a con­trol­led ex­per­i­ment. Per­haps, for ex­am­ple, it is difficult to tie an IT pro­ject to an in­crease in sales, but we might have lots of data about how some­thing else af­fects sales, such as faster time to mar­ket of new prod­ucts. If we know that faster time to mar­ket is pos­si­ble by au­tomat­ing cer­tain tasks, that this IT in­vest­ment elimi­nates cer­tain tasks, and those tasks are on the crit­i­cal path in the time-to-mar­ket, we can make the con­nec­tion.

Hub­bard’s book ex­plains the ba­sics of lin­ear re­gres­sions, and of course gives the caveat that cor­re­la­tion does not im­ply cau­sa­tion. But, he writes, “you should con­clude that one thing causes an­other only if you have some other good rea­son be­sides the cor­re­la­tion it­self to sus­pect a cause-and-effect re­la­tion­ship.”

Bayes

Hub­bard’s 10th chap­ter opens with a tu­to­rial on Bayes’ The­o­rem. For an on­line tu­to­rial, see here.

Hub­bard then zooms out to a big-pic­ture view of mea­sure­ment, and recom­mends the “in­stinc­tive Bayesian ap­proach”:

  1. Start with your cal­ibrated es­ti­mate.

  2. Gather ad­di­tional in­for­ma­tion (pol­ling, read­ing other stud­ies, etc.)

  3. Up­date your cal­ibrated es­ti­mate sub­jec­tively, with­out do­ing any ad­di­tional math.

Hub­bard says a few things in sup­port of this ap­proach. First, he points to some stud­ies (e.g. El-Ga­mal & Grether (1995)) show­ing that peo­ple of­ten rea­son in roughly-Bayesian ways. Next, he says that in his ex­pe­rience, peo­ple be­come bet­ter in­tu­itive Bayesi­ans when they (1) are made aware of the base rate fal­lacy, and when they (2) are bet­ter cal­ibrated.

Hub­bard says that once these con­di­tions are met,

[then] hu­mans seem to be mostly log­i­cal when in­cor­po­rat­ing new in­for­ma­tion into their es­ti­mates along with the old in­for­ma­tion. This fact is ex­tremely use­ful be­cause a hu­man can con­sider qual­i­ta­tive in­for­ma­tion that does not fit in stan­dard statis­tics. For ex­am­ple, if you were giv­ing a fore­cast for how a new policy might change “pub­lic image” – mea­sured in part by a re­duc­tion in cus­tomer com­plaints, in­creased rev­enue, and the like – a cal­ibrated ex­pert should be able to up­date cur­rent knowl­edge with “qual­i­ta­tive” in­for­ma­tion about how the policy worked for other com­pa­nies, feed­back from fo­cus groups, and similar de­tails. Even with sam­pling in­for­ma­tion, the cal­ibrated es­ti­ma­tor – who has a Bayesian in­stinct – can con­sider qual­i­ta­tive in­for­ma­tion on sam­ples that most text­books don’t cover.

He also offers a chart show­ing how a pure Bayesian es­ti­ma­tor com­pares to other es­ti­ma­tors:

Also, Bayes’ The­o­rem al­lows us to perform a “Bayesian in­ver­sion”:

Given a par­tic­u­lar ob­ser­va­tion, it may seem more ob­vi­ous to frame a mea­sure­ment by ask­ing the ques­tion “What can I con­clude from this ob­ser­va­tion?” or, in prob­a­bil­is­tic terms, “What is the prob­a­bil­ity X is true, given my ob­ser­va­tion?” But Bayes showed us that we could, in­stead, start with the ques­tion, “What is the prob­a­bil­ity of this ob­ser­va­tion if X were true?”

The sec­ond form of the ques­tion is use­ful be­cause the an­swer is of­ten more straight­for­ward and it leads to the an­swer to the other ques­tion. It also forces us to think about the like­li­hood of differ­ent ob­ser­va­tions given a par­tic­u­lar hy­poth­e­sis and what that means for in­ter­pret­ing an ob­ser­va­tion.

[For ex­am­ple] if, hy­po­thet­i­cally, we know that only 20% of the pop­u­la­tion will con­tinue to shop at our store, then we can de­ter­mine the chance [that] ex­actly 15 out of 20 would say so… [The de­tails are ex­plained in the book.] Then we can in­vert the prob­lem with Bayes’ the­o­rem to com­pute the chance that only 20% of the pop­u­la­tion will con­tinue to shop there given [that] 15 out of 20 said so in a ran­dom sam­ple. We would find that chance to be very nearly zero…

Other methods

Other chap­ters dis­cuss other mea­sure­ment meth­ods, for ex­am­ple pre­dic­tion mar­kets, Rasch mod­els, meth­ods for mea­sur­ing prefer­ences and hap­piness, meth­ods for im­prov­ing the sub­jec­tive judg­ments of ex­perts, and many oth­ers.

Step 5: Make a de­ci­sion and act on it

The last step will make more sense if we first “bring the pieces to­gether.” Hub­bard now or­ga­nizes his con­sult­ing work with a firm into 3 phases, so let’s re­view what we’ve learned in the con­text of his 3 phases.

Phase 0: Pro­ject Preparation

  • Ini­tial re­search: In­ter­views and sec­ondary re­search to get fa­mil­iar on the na­ture of the de­ci­sion prob­lem.

  • Ex­pert iden­ti­fi­ca­tion: Usu­ally 4–5 ex­perts who provide es­ti­mates.

Phase 1: De­ci­sion Modeling

  • De­ci­sion prob­lem defi­ni­tion: Ex­perts define the prob­lem they’re try­ing to an­a­lyze.

  • De­ci­sion model de­tail: Us­ing an Ex­cel spread­sheet, the AIE an­a­lyst elic­its from the ex­perts all the fac­tors that mat­ter for the de­ci­sion be­ing an­a­lyzed: costs and benefits, ROI, etc.

  • Ini­tial cal­ibrated es­ti­mates: First, the ex­perts un­dergo cal­ibra­tion train­ing. Then, they fill in the val­ues (as 90% CIs or other prob­a­bil­ity dis­tri­bu­tions) for the vari­ables in the de­ci­sion model.

Phase 2: Op­ti­mal measurements

  • Value of in­for­ma­tion anal­y­sis: Us­ing Ex­cel macros, the AIE an­a­lyst runs a value of in­for­ma­tion anal­y­sis on ev­ery vari­able in the model.

  • Pre­limi­nary mea­sure­ment method de­signs: Fo­cus­ing on the few vari­ables with high­est in­for­ma­tion value, the AIE an­a­lyst chooses mea­sure­ment meth­ods that should re­duce un­cer­tainty.

  • Mea­sure­ment meth­ods: De­com­po­si­tion, ran­dom sam­pling, Bayesian in­ver­sion, con­trol­led ex­per­i­ments, and other meth­ods are used (as ap­pro­pri­ate) to re­duce the un­cer­tainty of the high-VoI vari­ables.

  • Up­dated de­ci­sion model: The AIE an­a­lyst up­dates the de­ci­sion model based on the re­sults of the mea­sure­ments.

  • Fi­nal value of in­for­ma­tion anal­y­sis: The AIE an­a­lyst runs a VoI anal­y­sis on each vari­able again. As long as this anal­y­sis shows in­for­ma­tion value much greater than the cost of mea­sure­ment for some vari­ables, mea­sure­ment and VoI anal­y­sis con­tinues in mul­ti­ple iter­a­tions. Usu­ally, though, only one or two iter­a­tions are needed be­fore the VoI anal­y­sis shows that no fur­ther mea­sure­ments are jus­tified.

Phase 3: De­ci­sion op­ti­miza­tion and the fi­nal recommendation

  • Com­pleted risk/​re­turn anal­y­sis: A fi­nal MC simu­la­tion shows the like­li­hood of pos­si­ble out­comes.

  • Iden­ti­fied met­rics pro­ce­dures: Pro­ce­dures are put in place to mea­sure some vari­ables (e.g. about pro­ject progress or ex­ter­nal fac­tors) con­tinu­ally.

  • De­ci­sion op­ti­miza­tion: The fi­nal busi­ness de­ci­sion recom­men­da­tion is made (this is rarely a sim­ple “yes/​no” an­swer).

Fi­nal thoughts

Hub­bard’s book in­cludes two case stud­ies in which Hub­bard de­scribes how he led two fairly differ­ent clients (the EPA and U.S. Marine Corps) through each phase of the AIE pro­cess. Then, he closes the book with the fol­low­ing sum­mary:

  • If it’s re­ally that im­por­tant, it’s some­thing you can define. If it’s some­thing you think ex­ists at all, it’s some­thing you’ve already ob­served some­how.

  • If it’s some­thing im­por­tant and some­thing un­cer­tain, you have a cost of be­ing wrong and a chance of be­ing wrong.

  • You can quan­tify your cur­rent un­cer­tainty with cal­ibrated es­ti­mates.

  • You can com­pute the value of ad­di­tional in­for­ma­tion by know­ing the “thresh­old” of the mea­sure­ment where it be­gins to make a differ­ence com­pared to your ex­ist­ing un­cer­tainty.

  • Once you know what it’s worth to mea­sure some­thing, you can put the mea­sure­ment effort in con­text and de­cide on the effort it should take.

  • Know­ing just a few meth­ods for ran­dom sam­pling, con­trol­led ex­per­i­ments, or even merely im­prov­ing on the judg­ments of ex­perts can lead to a sig­nifi­cant re­duc­tion in un­cer­tainty.