The insularity critique of climate science

Note: Please see this post of mine for more on the pro­ject, my sources, and po­ten­tial sources for bias.

One of the cat­e­gories of cri­tique that have been lev­eled against cli­mate sci­ence is the cri­tique of in­su­lar­ity. Broadly, it is claimed that the type of work that cli­mate sci­en­tists are try­ing to do draws upon in­sight and ex­per­tise in many other do­mains, but cli­mate sci­en­tists have his­tor­i­cally failed to con­sult ex­perts in those do­mains or even to fol­low well-doc­u­mented best prac­tices.

Some take­aways/​conclusions

Note: I wrote a pre­limi­nary ver­sion of this be­fore draft­ing the post, but af­ter hav­ing done most of the rele­vant in­ves­ti­ga­tion. I re­viewed and ed­ited it prior to pub­li­ca­tion. Note also that I don’t jus­tify these take­aways ex­plic­itly in my later dis­cus­sion, be­cause a lot of these come from gen­eral in­tu­itions of mine and it’s hard to ar­tic­u­late how the in­for­ma­tion I re­ceived ex­plic­itly af­fected my reach­ing the take­aways. I might dis­cuss the ra­tio­nales be­hind these take­aways more in a later post.

  • Many of the crit­i­cisms are broadly on the mark: cli­mate sci­en­tists should have con­sulted best prac­tices in other do­mains, and in gen­eral should have ei­ther fol­lowed them or clearly ex­plained the rea­sons for di­ver­gence.

  • How­ever, this crit­i­cism is not unique to cli­mate sci­ence: academia in gen­eral has suffered from prob­lems of dis­ci­plines be­ing rel­a­tively in­su­lar (UPDATE: Here’s Robin Han­son say­ing some­thing similar). And many similar things may be true, albeit in differ­ent ways, out­side academia.

  • One in­ter­est­ing pos­si­bil­ity is that bad prac­tices here op­er­ate via founder effects: for an area that starts off as rel­a­tively ob­scure and unim­por­tant, set­ting up good prac­tices may not be con­sid­ered im­por­tant. But as the area grows in im­por­tance, it is quite rare for the area to be cleaned up. Peo­ple and in­sti­tu­tions get used to the old ways of do­ing things. They have too much at stake to make re­forms. This does sug­gest that it’s im­por­tant to get things right early on.

  • (This is spec­u­la­tive, and not dis­cussed in the post): The ex­tent of in­su­lar­ity of a dis­ci­pline seems to be an area where a few re­searchers can have sig­nifi­cant effect on the dis­ci­pline. If a few rea­son­ably in­fluen­tial cli­mate sci­en­tists had pushed for more in­te­gra­tion with and un­der­stand­ing of ideas from other dis­ci­plines, the his­tory of cli­mate sci­ence re­search would have been differ­ent.

Rele­vant do­mains they may have failed to use or learn from

  1. Fore­cast­ing re­search: Although cli­mate sci­en­tists were en­gag­ing in an ex­er­cise that had a lot to do with fore­cast­ing, they nei­ther cited re­search nor con­sulted ex­perts in the do­main of fore­cast­ing.

  2. Statis­tics: Cli­mate sci­en­tists used plenty of statis­tics in their anal­y­sis. They did fol­low the ba­sic prin­ci­ples of statis­tics, but in many cases used them in­cor­rectly or com­bined them with novel ap­proaches that were non­stan­dard and did not have clear statis­ti­cal liter­a­ture jus­tify­ing the use of such ap­proaches.

  3. Pro­gram­ming and soft­ware en­g­ineer­ing: Cli­mate sci­en­tists used a lot of code both for their cli­mate mod­els and for their analy­ses of his­tor­i­cal cli­mate. But their code failed ba­sic prin­ci­ples of de­cent pro­gram­ming, let alone good soft­ware en­g­ineer­ing prin­ci­ples such as doc­u­men­ta­tion, unit test­ing, con­sis­tent vari­able names, and ver­sion con­trol.

  4. Publi­ca­tion of data, meta­data, and code: This is a phe­nomenon be­com­ing in­creas­ingly com­mon in some other sec­tors of academia and in­dus­try. Cli­mate sci­en­tists they failed to learn from econo­met­rics and biomed­i­cal re­search, fields that had been strug­gling with some qual­i­ta­tively similar prob­lems and that had been mov­ing to pub­lish­ing data, meta­data, and code.

Let’s look at each of these cri­tiques in turn.

Cri­tique #1: Failure to con­sider fore­cast­ing research

We’ll de­vote more at­ten­tion to this cri­tique, be­cause it has been made, and ad­dressed, co­gently in con­sid­er­able de­tail.

J. Scott Arm­strong (fac­ulty page, Wikipe­dia) is one of the big names in fore­cast­ing. In 2007, Arm­strong and Kesten C. Green co-au­thored a global warm­ing au­dit (PDF of pa­per, web­page with sup­port­ing ma­te­ri­als) for the Fore­cast­ing Prin­ci­ples web­site. that was crit­i­cal of the fore­cast­ing ex­er­cises by cli­mate sci­en­tists used in the IPCC re­ports.

Arm­strong and Green be­gan their cri­tique by not­ing the fol­low­ing:

  • The cli­mate sci­ence liter­a­ture did not refer­ence any of the fore­cast­ing liter­a­ture, and there was no in­di­ca­tion that they had con­sulted fore­cast­ing ex­perts, even though what they were do­ing was to quite an ex­tent a fore­cast­ing ex­er­cise.

  • There was only one pa­per, by Ste­wart and Glantz, dat­ing back to 1985, that could be de­scribed as a fore­cast­ing au­dit, and that pa­per was crit­i­cal of the method­ol­ogy of cli­mate fore­cast­ing. And that pa­per ap­pears to have been cited very lit­tle in the com­ing years.

  • Arm­strong and Green tried to con­tact lead­ing cli­mate sci­en­tists. Of the few who re­sponded, none listed spe­cific fore­cast­ing prin­ci­ples they fol­lowed, or rea­sons for not fol­low­ing gen­eral fore­cast­ing prin­ci­ples. They pointed to the IPCC re­ports as the best source for fore­casts. Arm­strong and Green es­ti­mated that the IPCC re­port vi­o­lated 72 of 89 fore­cast­ing prin­ci­ples they were able to rate (their list of fore­cast­ing prin­ci­ples in­cludes 140 prin­ci­ples, but they judged only 127 as ap­pli­ca­ble to cli­mate fore­cast­ing, and were able to rate only 89 of them). No cli­mate sci­en­tists re­sponded to their in­vi­ta­tion to provide their own rat­ings for the fore­cast­ing prin­ci­ples.

How sig­nifi­cant are these gen­eral crit­i­cisms? It de­pends on the an­swers to the fol­low­ing ques­tions:

  • In gen­eral, how much cre­dence do you as­sign to the re­search on fore­cast­ing prin­ci­ples, and how strong a prior do you have in fa­vor of these prin­ci­ples be­ing ap­pli­ca­ble to a spe­cific do­main? I think the an­swer is that fore­cast­ing prin­ci­ples as iden­ti­fied on the Fore­cast­ing Prin­ci­ples web­site are a rea­son­able start­ing point, and there­fore, any ma­jor fore­cast­ing ex­er­cise (or ex­er­cise that im­plic­itly gen­er­ates fore­casts) should at any rate jus­tify ma­jor points of de­par­ture from these prin­ci­ples.

  • How rep­re­sen­ta­tive are the views of Arm­strong and Green in the fore­cast­ing com­mu­nity? I have no idea about the rep­re­sen­ta­tive­ness of their spe­cific views, but Arm­strong in par­tic­u­lar is high-sta­tus in the fore­cast­ing com­mu­nity (that I de­scribed a while back) and the Fore­cast­ing Prin­ci­ples web­site is one of the go-to sources, so ma­te­rial on the web­site is prob­a­bly not too far from views in the fore­cast­ing com­mu­nity. (Note: I asked the ques­tion on Quora a while back, but haven’t re­ceived any an­swers).

So it seems like there was ar­guably a failure of proper pro­ce­dure in the cli­mate sci­ence com­mu­nity in terms of con­sult­ing and ap­ply­ing prac­tices from rele­vant do­mains. Still, how ger­mane was it to the qual­ity of their con­clu­sions? Maybe it didn’t mat­ter af­ter all?

In Chap­ter 12 of The Sig­nal and the Noise, statis­ti­cian and fore­caster Nate Silver offers the fol­low­ing sum­mary of Arm­strong and Green’s views:

  • First, Arm­strong and Green con­tend that agree­ment among fore­cast­ers is not re­lated to ac­cu­racy—and may re­flect bias as much as any­thing else. “You don’t vote,” Arm­strong told me. “That’s not the way sci­ence pro­gresses.”

  • Next, they say the com­plex­ity of the global warm­ing prob­lem makes fore­cast­ing a fool’s er­rand. “There’s been no case in his­tory where we’ve had a com­plex thing with lots of vari­ables and lots of un­cer­tainty, where peo­ple have been able to make econo­met­ric mod­els or any com­plex mod­els work,” Arm­strong told me. “The more com­plex you make the model the worse the fore­cast gets.”

  • Fi­nally, Arm­strong and Green write that the fore­casts do not ad­e­quately ac­count for the un­cer­tainty in­trin­sic to the global warm­ing prob­lem. In other words, they are po­ten­tially over­con­fi­dent.


Silver, Nate (2012-09-27). The Sig­nal and the Noise: Why So Many Pre­dic­tions Fail-but Some Don’t (p. 382). Pen­guin Group US. Kin­dle Edi­tion.

Silver ad­dresses each of these in his book (read it to know what he says). Here are my own thoughts on the three points as put forth by Silver:

  • I think con­sen­sus among ex­perts (to the ex­tent that it does ex­ist) should be taken as a pos­i­tive sig­nal, even if the ex­perts aren’t good at fore­cast­ing. But cer­tainly, the lack of in­ter­est or suc­cess in fore­cast­ing should dampen the mag­ni­tude of the pos­i­tive sig­nal. We should con­sider it likely that cli­mate sci­en­tists have iden­ti­fied im­por­tant po­ten­tial phe­nom­ena, but should be skep­ti­cal of any ac­tual fore­casts de­rived from their work.

  • I dis­agree some­what with this point. I think fore­cast­ing could still be pos­si­ble, but as of now, there is lit­tle of a suc­cess­ful track record of fore­cast­ing (as Green notes in a later draft pa­per). So fore­cast­ing efforts, in­clud­ing sim­ple ones (such as per­sis­tence, lin­ear re­gres­sion, ran­dom walk with drift) and ones based on cli­mate mod­els (both the ones in com­mon use right now and oth­ers that give more weight to the PDO/​AMO), should con­tinue but the jury is still out on the ex­tent to which they work.

  • I agree here that many fore­cast­ers are po­ten­tially over­con­fi­dent.

Some coun­ter­points to the Arm­strong and Green cri­tique:

  • One can ar­gue that what cli­mate sci­en­tists are do­ing isn’t fore­cast­ing at all, but sce­nario anal­y­sis. After all, the IPCC gen­er­ates sce­nar­ios, but not fore­casts. But as I dis­cussed in an ear­lier post, sce­nario plan­ning and fore­cast­ing are closely re­lated, and even if sce­nar­ios aren’t di­rect ex­plicit un­con­di­tional fore­casts, they of­ten in­volve im­plicit con­di­tional fore­casts. To its credit, the IPCC does seem to have used some best prac­tices from the sce­nario plan­ning liter­a­ture in gen­er­at­ing its emis­sions sce­nar­ios. But that is not part of the cli­mate mod­el­ing ex­er­cise of the IPCC.

  • Many other do­mains that in­volve plan­ning for the fu­ture don’t refer­ence the fore­cast­ing liter­a­ture. Ex­am­ples in­clude sce­nario plan­ning (dis­cussed here) and the re­lated field of fu­tures stud­ies (dis­cussed here). In­su­lar­ity of dis­ci­plines from each other is a com­mon fea­ture (or bug) in much of academia. Can we re­ally ex­pect or de­mand that cli­mate sci­en­tists hold them­selves to a higher stan­dard?

UPDATE: I for­got to men­tion in my origi­nal draft of the post that Arm­strong challenged Al Gore to a bet pit­ting Arm­strong’s No Change model with the IPCC model. Gore did not ac­cept the bet, but Arm­strong cre­ated the web­site (here) any­way to record the rel­a­tive perfor­mance of the two mod­els.

UPDATE 2: Read dr­nick­bone’s com­ment and my replies for more in­for­ma­tion on the de­bate. dr­nick­bone in par­tic­u­lar points to re­sponses from Real Cli­mate and Skep­ti­cal Science, that I dis­cuss in my re­sponse to his com­ment.

    Cri­tique #2: Inap­pro­pri­ate or mis­guided use of statis­tics, and failure to con­sult statisticians

    To some ex­tent, this over­laps with Cri­tique #1, be­cause best prac­tices in fore­cast­ing in­clude good use of statis­ti­cal meth­ods. How­ever, the cri­tique is a lit­tle broader. There are many parts of cli­mate sci­ence not di­rectly in­volved with fore­cast­ing, but where statis­ti­cal meth­ods still mat­ter. His­tor­i­cal cli­mate re­con­struc­tion is one such ex­am­ple. The pur­pose of these is to get a bet­ter un­der­stand­ing of the sorts of cli­mate that could oc­cur and have oc­curred, and how differ­ent as­pects of the cli­mate cor­re­lated. Un­for­tu­nately, his­tor­i­cal cli­mate data is not very re­li­able. How do we deal with differ­ent prox­ies for the cli­mate vari­ables we are in­ter­ested in so that we can re­con­struct them? A care­ful use of statis­tics is im­por­tant here.

    Let’s con­sider an ex­am­ple that’s quite far re­moved from cli­mate fore­cast­ing, but has (per­haps un­de­servedly) played an im­por­tant role in the pub­lic de­bate on global warm­ing: Michael Mann’s famed hockey stick (Wikipe­dia), dis­cussed in de­tail in Mann, Bradley and Hughes (hence­forth, MBH98) (available on­line here). The ma­jor cri­tiques of the pa­per arose in a se­ries of pa­pers by McIn­tyre and McKitrick, the most im­por­tant of them be­ing their 2005 pa­per in Geo­phys­i­cal Re­search Let­ters (hence­forth, MM05) (available on­line here).

    I read about the con­tro­versy in the book The Hockey Stick Illu­sion by An­drew Mont­ford (Ama­zon, Wikipe­dia), but the au­thor also has a shorter ar­ti­cle ti­tled Cas­par and the Je­sus pa­per that cov­ers the story as it un­folds from his per­spec­tive. While there’s a lot more to the hockey stick con­tro­versy than statis­tics alone, some of the main is­sues are statis­ti­cal.

    Un­for­tu­nately, I wasn’t able to re­solve the statis­ti­cal is­sues my­self well enough to have an in­formed view. But my very crude in­tu­ition, as well as the state­ments made by statis­ti­ci­ans as recorded be­low, sup­ports Mont­ford’s broad out­line of the story. I’ll try to de­scribe the broad cri­tiques lev­eled from the statis­ti­cal per­spec­tive:

    • Choice of cen­ter­ing and stan­dard­iza­tion: The data was cen­tered around the 20th cen­tury, a method known as short-cen­ter­ing, and bound to cre­ate a bias in fa­vor of pick­ing hockey stick-like shapes when do­ing prin­ci­pal com­po­nents anal­y­sis. Each se­ries was also stan­dard­ized (di­vided by the stan­dard de­vi­a­tion for the 20th cen­tury), which McIn­tyre ar­gued was in­ap­pro­pri­ate.

    • Unusual choice of statis­tic used for sig­nifi­cance: MBH98 used a statis­tic called the RE statis­tic (re­duc­tion of er­ror statis­tic). This is a fairly un­usual statis­tic to use. In fact, it doesn’t have a Wikipe­dia page, and prac­ti­cally the only stuff on the web (on Google and Google Scholar) about it was in re­la­tion to tree-ring re­search (the prox­ies used in MBH98 were tree rings). This should seem sus­pi­cious: why is tree-ring re­search us­ing a statis­tic that’s ba­si­cally un­used out­side the field? There are good rea­sons to avoid us­ing statis­ti­cal con­structs on which there is lit­tle statis­ti­cal liter­a­ture, be­cause peo­ple don’t have a feel for how they work. MBH98 could have used the R^2 statis­tic in­stead, and in fact, they men­tioned it in their pa­per but then ended up not us­ing it.

    • In­cor­rect calcu­la­tion of sig­nifi­cance thresh­old: MM05 (plus sub­se­quent com­ments by McIn­tyre) claims that not only is the RE statis­tic non­stan­dard, there were prob­lems with the way MBH98 used it. First off, there is no the­o­ret­i­cal dis­tri­bu­tion of the RE statis­tic, so calcu­lat­ing the cut­off needed to at­tain a par­tic­u­lar sig­nifi­cance level is a tricky ex­er­cise (this is one of many rea­sons why us­ing a RE statis­tic may be ill-ad­vised, ac­cord­ing to McIn­tyre). MBH98 calcu­lated the cut­off value for 99% sig­nifi­cance in­cor­rectly to be 0. The cor­rect value ac­cord­ing to McIn­tyre was about 0.54, whereas the ac­tual RE statis­tic value for the data set in MBH98 was 0.48, i.e., not close enough. A later pa­per by Am­mann and Wahl, cited by many as a vin­di­ca­tion of MBH98, com­puted a similar cut­off of 0.52, so that the ac­tual RE statis­tic value failed the sig­nifi­cance test. So how did it man­age to vin­di­cate MBH98 when the value of the RE statis­tic failed the cut­off? They ap­pear to have em­ployed a novel statis­ti­cal pro­ce­dure, com­ing up with some­thing called a cal­ibra­tion/​ver­ifi­ca­tion RE ra­tio. McIn­tyre was quite crit­i­cal of this, for rea­sons he de­scribed in de­tail here.

    There has been a lengthy de­bate on the sub­ject, plus two ex­ter­nal in­quiries and re­ports on the de­bate: the NAS Panel Re­port headed by Gerry North, and the Weg­man Re­port headed by Ed­ward Weg­man. Both of them agreed with the statis­ti­cal crit­i­cisms made by McIn­tyre, but the NAS re­port did not make any broader com­ments on what this says about the dis­ci­pline or the gen­eral hockey stick hy­poth­e­sis, while the Weg­man re­port made more ex­plicit crit­i­cism.

    The Weg­man Re­port made the in­su­lar­ity cri­tique in some de­tail:

    In gen­eral, we found MBH98 and MBH99 to be some­what ob­scure and in­com­plete and the crit­i­cisms of MM03/​05a/​05b to be valid and com­pel­ling. We also com­ment that they were at­tempt­ing to draw at­ten­tion to the dis­crep­an­cies in MBH98 and MBH99, and not to do pa­le­o­cli­matic tem­per­a­ture re­con­struc­tion. Nor­mally, one would try to se­lect a cal­ibra­tion dataset that is rep­re­sen­ta­tive of the en­tire dataset. The 1902-1995 data is not fully ap­pro­pri­ate for cal­ibra­tion and leads to a mi­suse in prin­ci­pal com­po­nent anal­y­sis. How­ever, the rea­sons for set­ting 1902-1995 as the cal­ibra­tion point pre­sented in the
    nar­ra­tive of MBH98 sounds rea­son­able, and the er­ror may be eas­ily over­looked by some­one not trained in statis­ti­cal method­ol­ogy. We note that there is no ev­i­dence that Dr. Mann or any of the other au­thors in pa­le­o­cli­ma­tol­ogy stud­ies have had sig­nifi­cant in­ter­ac­tions with main­stream statis­ti­ci­ans.

    In our fur­ther ex­plo­ra­tion of the so­cial net­work of au­thor­ships in tem­per­a­ture re­con­struc­tion, we found that at least 43 au­thors have di­rect ties to Dr. Mann by virtue of coau­thored pa­pers with him. Our find­ings from this anal­y­sis sug­gest that au­thors in the area of pa­le­o­cli­mate stud­ies are closely con­nected and thus ‘in­de­pen­dent stud­ies’ may not be as in­de­pen­dent as they might ap­pear on the sur­face. This com­mit­tee does not be­lieve that web logs are an ap­pro­pri­ate fo­rum for the sci­en­tific de­bate on this is­sue.

    It is im­por­tant to note the iso­la­tion of the pa­le­o­cli­mate com­mu­nity; even though they rely heav­ily on statis­ti­cal meth­ods they do not seem to be in­ter­act­ing with the statis­ti­cal com­mu­nity. Ad­di­tion­ally, we judge that the shar­ing of re­search ma­te­ri­als, data and re­sults was hap­haz­ardly and grudg­ingly done. In this case we judge that there was too much re­li­ance on peer re­view, which was not nec­es­sar­ily in­de­pen­dent. More­over, the work has been suffi­ciently poli­ti­cized that this com­mu­nity can hardly re­assess their pub­lic po­si­tions with­out los­ing cred­i­bil­ity. Over­all, our com­mit­tee be­lieves that Mann’s as­sess­ments that the decade of the 1990s was the hottest decade of the mil­len­nium and that 1998 was the hottest year of the mil­len­nium can­not be sup­ported by his anal­y­sis.

    McIn­tyre has a lengthy blog post sum­ma­riz­ing what he sees as the main parts of the NAS Panel Re­port, the Weg­man Re­port, and other state­ments made by statis­ti­ci­ans crit­i­cal of MBH98.

    Cri­tique #3: Inad­e­quate use of soft­ware en­g­ineer­ing, pro­ject man­age­ment, and cod­ing doc­u­men­ta­tion and test­ing principles

    In the af­ter­math of Cli­mate­gate, most pub­lic at­ten­tion was drawn to the con­tent of the emails. But apart from the emails, data and code was also leaked, and this gave the world an in­side view of the code that’s used to simu­late the cli­mate. A num­ber of crit­i­cisms of the cod­ing prac­tice emerged.

    Chicago Boyz had a lengthy post ti­tled Scien­tists are not Soft­ware Eng­ineers that noted the slop­piness in the code, and some of the im­pli­ca­tions, but was also quick to point out that poor-qual­ity code is not unique to cli­mate sci­ence and is a gen­eral prob­lem with large-scale pro­jects that arise from small-scale aca­demic re­search grow­ing be­yond what the coders origi­nally in­tended, but with no sys­tem­atic efforts be­ing made to re­fac­tor the code (if you have thoughts on the gen­eral prevalence of good soft­ware en­g­ineer­ing prac­tices in code for aca­demic re­search, feel free to share them by an­swer­ing my Quora ques­tion here, and if you have in­sights on cli­mate sci­ence code in par­tic­u­lar, an­swer my Quora ques­tion here). Below are some ex­cerpts from the post:

    No, the real shock­ing rev­e­la­tion lies in the com­puter code and data that were dumped along with the emails. Ar­guably, these are the most im­por­tant com­puter pro­grams in the world. Th­ese pro­grams gen­er­ate the data that is used to cre­ate the cli­mate mod­els which pur­port to show an in­evitable catas­trophic warm­ing caused by hu­man ac­tivity. It is on the ba­sis of these pro­grams that we are sup­posed to mas­sively reeng­ineer the en­tire plane­tary econ­omy and tech­nol­ogy base.

    The dumped files re­vealed that those crit­i­cal pro­grams are com­plete and ut­ter train wrecks.

    [...]

    The de­sign, pro­duc­tion and main­te­nance of large pieces of soft­ware re­quire pro­ject man­age­ment skills greater than those re­quired for large ma­te­rial con­struc­tion pro­jects. Com­puter pro­grams are the most com­pli­cated pieces of tech­nol­ogy ever cre­ated. By sev­eral or­ders of mag­ni­tude they have more “parts” and more in­ter­ac­tions be­tween those parts than any other tech­nol­ogy.

    Soft­ware en­g­ineers and soft­ware pro­ject man­agers have cre­ated pro­ce­dures for man­ag­ing that com­plex­ity. It be­gins with seem­ingly triv­ial things like style guides that reg­u­late what names pro­gram­mers can give to at­tributes of soft­ware and the as­so­ci­ated datafiles. Then you have ver­sion con­trol in which ev­ery change to the soft­ware is recorded in a database. Pro­gram­mers have to doc­u­ment ab­solutely ev­ery­thing they do. Be­fore they write code, there is ex­ten­sive plan­ning by many peo­ple. After the code is writ­ten comes the dreaded code re­view in which other pro­gram­mers and man­agers go over the code line by line and look for faults. After the code reaches its semi-com­plete form, it is handed over to Qual­ity As­surance which is staffed by drool­ing, be­fanged, mal­i­cious so­ciopaths who live for noth­ing more than to take a pro­gram­mer’s great­est, most el­e­gant code and rip it apart and pos­si­bly sex­u­ally vi­o­late it. (Yes, I’m still bit­ter.)

    In­sti­tu­tions pay for all this over­sight and dou­ble-check­ing and pro­gram­mers tol­er­ate it be­cause it is im­pos­si­ble to cre­ate a large, re­li­able and ac­cu­rate piece of soft­ware with­out such pro­ce­dures firmly in place. Soft­ware is just too com­plex to wing it.

    Clearly, noth­ing like these es­tab­lished pro­ce­dures was used at CRU. In­deed, the code seems to have been writ­ten over­whelm­ingly by just two peo­ple (one at a time) over the past 30 years. Nei­ther of these in­di­vi­d­u­als was a for­mally trained pro­gram­mer and there ap­pears to have been no pro­ject plan­ning or even for­mal doc­u­men­ta­tion. In­deed, the com­ments of the sec­ond pro­gram­mer, the hap­less “Harry”, as he strug­gled to un­der­stand the work of his pre­de­ces­sor are now be­ing read as a kind of pro­gram­mer’s Ice­landic saga de­scribing a death march through an in­ex­pli­ca­ble maze of in­ep­ti­tude and booby­traps.

    [...]

    A lot of the CRU code is clearly com­posed of hacks. Hacks are in­for­mal, off-the-cuff solu­tions that pro­gram­mers think up on the spur of the mo­ment to fix some lit­tle prob­lem. Some­times they are so el­e­gant as to be awe in­spiring and they en­ter pro­gram­ming lore. More of­ten, how­ever, they are crude, sloppy and dan­ger­ously un­re­li­able. Pro­gram­mers usu­ally use hacks as a tem­po­rary quick solu­tion to a bot­tle­neck prob­lem. The in­ten­tion is always to come back later and re­place the hack with a more well-thought-out and re­li­able solu­tion, but with no for­mal pro­ject man­age­ment and time con­straints it’s easy to for­get to do so. After a time, more code evolves that de­pends on the ex­is­tence of the hack, so re­plac­ing it be­comes a much big­ger task than just re­plac­ing the ini­tial hack would have been.

    (One hack in the CRU soft­ware will no doubt be­come fa­mous. The pro­gram­mer needed to calcu­late the dis­tance and over­lap­ping effect be­tween weather mon­i­tor­ing sta­tions. The non-hack way to do so would be to break out the tri­gonom­e­try and write a planned piece of code to calcu­late the spa­tial re­la­tion­ships. In­stead, the CRU pro­gram­mer no­ticed that that the vi­su­al­iza­tion soft­ware that dis­played the pro­gram’s re­sults already plot­ted the sta­tion’s lo­ca­tions so he sam­pled in­di­vi­d­ual pix­els on the screen and used the color of the pix­els be­tween the sta­tions to de­ter­mine their lo­ca­tion and over­lap! This is a frag­ile hack be­cause if the vi­su­al­iza­tion changes the col­ors it uses, the com­po­nents that de­pend on the hack will fail silently.)

    For some choice com­ments ex­cerpted from a code file, see here.

    Cri­tique #4: Prac­tices of pub­li­ca­tion of data, meta­data, and code (that had gained trac­tion in other dis­ci­plines)

    When McIn­tyre wanted to repli­cate MBH98, he emailed Mann ask­ing for his data and code. Mann, though ini­tially co­op­er­a­tive, soon started try­ing to fed McIn­tyre off. Part of this was be­cause he thought McIn­tyre was out to find some­thing wrong with his work (a well-grounded sus­pi­cion). But part of it was also that his data and code were a mess. He didn’t main­tain them in a way that he’d be com­fortable shar­ing them around to any­body other than an already sym­pa­thetic aca­demic. And, more im­por­tantly, as Mann’s col­league Stephen Sch­nei­der noted, no­body asked for the code and un­der­ly­ing data dur­ing peer re­view. And most jour­nals at the time did not re­quire au­thors to sub­mit or archive their code and data at the time of sub­mis­sion or ac­cep­tance of their pa­per. This also closely re­lates to Cri­tique #3: a re­quire­ment or ex­pec­ta­tion that one’s data and code would be pub­lished along with one’s pa­per might make peo­ple more care­ful to fol­low good cod­ing prac­tices and avoid us­ing var­i­ous “tricks” and “hacks” in their code.

    Here’s how An­drew Mont­ford puts it in The Hockey Stick Illu­sion:

    The Hockey Stick af­fair is not the first scan­dal in which im­por­tant sci­en­tific pa­pers un­der­pin­ning gov­ern­ment policy po­si­tions have been found to be non-repli­ca­ble – McCul­lough and McKitrick re­view a litany of sorry cases from sev­eral differ­ent fields – but it does un­der­line the need for a more solid ba­sis on which poli­ti­cal de­ci­sion-mak­ing should be based. That ba­sis is repli­ca­tion. Cen­turies of sci­en­tific en­deav­our have shown that truth emerges only from re­peated ex­per­i­men­ta­tion and falsifi­ca­tion of the­o­ries, a pro­cess that only be­gins af­ter pub­li­ca­tion and can con­tinue for months or years or decades there­after. Only through ac­tu­ally re­pro­duc­ing the find­ings of a sci­en­tific pa­per can other re­searchers be cer­tain that those find­ings are cor­rect. In the early his­tory of Euro­pean sci­ence, pub­li­ca­tion of sci­en­tific find­ings in a jour­nal was usu­ally ad­e­quate to al­low other re­searchers to repli­cate them. How­ever, as sci­ence has ad­vanced, the tech­niques used have be­come steadily more com­pli­cated and con­se­quently more difficult to ex­plain. The ad­vent of com­put­ers has al­lowed sci­en­tists to add fur­ther lay­ers of com­plex­ity to their work and to han­dle much larger datasets, to the ex­tent that a jour­nal ar­ti­cle can now, in most cases, no longer be con­sid­ered a defini­tive record of a sci­en­tific re­sult. There is sim­ply in­suffi­cient space in the pages of a print jour­nal to ex­plain what ex­actly has been done. This has pro­duced a rather profound change in the pur­pose of a sci­en­tific pa­per. As geo­physi­cist Jon Claer­bout puts it, in a world where pow­er­ful com­put­ers and vast datasets dom­i­nate sci­en­tific re­search, the pa­per ‘is not the schol­ar­ship it­self, it is merely ad­ver­tis­ing of the schol­ar­ship’.b The ac­tual schol­ar­ship is the data and code used to gen­er­ate the figures pre­sented in the pa­per and which un­der­pin its claims to unique­ness. In pass­ing we should note the im­pli­ca­tions of Claer­bout’s ob­ser­va­tions for the as­sess­ment for our con­clu­sions in the last sec­tion: by us­ing only peer re­view to as­sess the cli­mate sci­ence liter­a­ture, the poli­cy­mak­ing com­mu­nity is im­plic­itly ex­pect­ing that a read-through of a par­tial ac­count of the re­search performed will be suffi­cient to iden­tify any er­rors or other prob­lems with the pa­per. This is sim­ply not cred­ible. With a full ex­pla­na­tion of method­ol­ogy now of­ten not pos­si­ble from the text of a pa­per, repli­ca­tion can usu­ally only be performed if the data and code are available. This is a ma­jor change from a hun­dred years ago, but in the twenty-first cen­tury it should be a triv­ial prob­lem to ad­dress. In some spe­cial­isms it is just that. We have seen, how­ever, how al­most ev­ery at­tempt to ob­tain data from cli­ma­tol­o­gists is met by a wall of eva­sion and obfus­ca­tion, with jour­nals and fund­ing bod­ies ei­ther un­able or un­will­ing to as­sist. This is, of course, un­eth­i­cal and un­ac­cept­able, par­tic­u­larly for pub­li­cly funded sci­en­tists. The pub­lic has paid for nearly all of this data to be col­lated and has a right to see it dis­tributed and reused. As the treat­ment of the Loehle pa­per shows,c for sci­en­tists to open them­selves up to crit­i­cism by al­low­ing open re­view and full data ac­cess is a profoundly un­com­fortable pro­cess, but the pub­lic is not pay­ing sci­en­tists to have com­fortable lives; they are pay­ing for rapid ad­vances in sci­ence. If data is available, doubts over ex­actly where the re­searcher has started from fall away. If com­puter code is made pub­lic too, then the task of repli­ca­tion be­comes sim­pler still and all doubts about the method­ol­ogy are re­moved. The de­bate moves on from fool­ish and long-winded ar­gu­ments about what was done (we still have no idea ex­actly how Mann calcu­lated his con­fi­dence in­ter­vals) onto the real sci­en­tific meat of whether what was done was cor­rect. As we look back over McIn­tyre’s work on the Hockey Stick, we see that much of his time was wasted on try­ing to un­cover from the ob­scure word­ing of Mann’s pa­pers ex­actly what pro­ce­dures had been used. Again, we can only state that this is en­tirely un­ac­cept­able for pub­li­cly funded sci­ence and is un­for­give­able in an area of such enor­mous policy im­por­tance. As well as helping sci­en­tists to find er­rors more quickly, repli­ca­tion has other benefits that are not in­signifi­cant. David Good­stein of the Cal­ifor­nia In­si­tute of Tech­nol­ogy has com­mented that the pos­si­bil­ity that some­one will try to repli­cate a piece of work is a pow­er­ful dis­in­cen­tive to cheat­ing – in other words, it can help to pre­vent sci­en­tific fraud.251 Good­stein also notes that, in re­al­ity, very few sci­en­tific pa­pers are ever sub­ject to an at­tempt to repli­cate them. It is clear from Stephen Sch­nei­der’s sur­prise when asked to ob­tain the data be­hind one of Mann’s pa­pers that this crit­i­cism ex­tends into the field of cli­ma­tol­ogy.d In a world where pres­sure from fund­ing agen­cies and the de­mands of uni­ver­sity ca­reers mean that aca­demics have to pub­lish or per­ish, pre­cious few re­sources are free to repli­cate the work of oth­ers. In years gone by, some of the time of PhD stu­dents might have been de­voted to repli­cat­ing the work of ri­val labs, but few stu­dents would ac­cept such a me­nial task in the mod­ern world: they have their own pub­li­ca­tion records to worry about. It is un­for­give­able, there­fore, that in pa­le­o­cli­mate cir­cles, the few at­tempts that have been made at repli­ca­tion have been blocked by all of the par­ties in a po­si­tion to do some­thing about it. Med­i­cal sci­ence is far ahead of the phys­i­cal sci­ences in the area of repli­ca­tion. Doug Alt­man, of Cancer Re­search UK’s Med­i­cal Statis­tics group, has com­mented that archiv­ing of data should be manda­tory and that a failure to re­tain data should be treated as re­search mis­con­duct.252 The in­tro­duc­tion of this kind of regime to cli­ma­tol­ogy could have noth­ing but a salu­tary effect on its rather tar­nished rep­u­ta­tion. Other sub­ject ar­eas, how­ever, have found sim­pler and less con­fronta­tional ways to deal with the prob­lem. In ar­eas such as econo­met­rics, which have long suffered from poli­ti­ci­sa­tion and fraud, sev­eral jour­nals have adopted clear and rigor­ous poli­cies on archiv­ing of data. At pub­li­ca­tions such as the Amer­i­can Eco­nomic Re­view, Econo­met­rica and the Jour­nal of Money, Credit and Bank­ing, a manuscript that is sub­mit­ted for pub­li­ca­tion will sim­ply not be ac­cepted un­less data and fully func­tional code are available. In other words, if the data and code are not pub­lic then the jour­nals will not even con­sider the ar­ti­cle for pub­li­ca­tion, ex­cept in very rare cir­cum­stances. This is sim­ple, fair and trans­par­ent and works with­out any dis­sent. It also avoids any ran­corous dis­agree­ments be­tween jour­nal and au­thor af­ter the event. Phys­i­cal sci­ence jour­nals are, by and large, far be­hind the econo­me­tri­ci­ans on this score. While most have adopted one pi­ous policy or an­other, giv­ing the ap­pear­ance of trans­parency on data and code, as we have seen in the un­fold­ing of this story, there has been a near-com­plete failure to en­force these rules. This failure sim­ply stores up po­ten­tial prob­lems for the ed­i­tors: if an au­thor re­fuses to re­lease his data, the jour­nal is left with an en­force­ment prob­lem from which it is very difficult to ex­tri­cate them­selves. Their sole po­ten­tial sanc­tion is to with­draw the pa­per, but this then merely opens them up to the pos­si­bil­ity of ex­pen­sive law­suits. It is hardly sur­pris­ing that in prac­tice such dras­tic steps are never taken. The failure of cli­ma­tol­ogy jour­nals to en­act strict poli­cies or en­force weaker ones rep­re­sents a se­ri­ous failure in the sys­tem of as­surance that tax­payer-funded sci­ence is rigor­ous and re­li­able. Fund­ing bod­ies claim that they rely on jour­nals to en­sure data availa­bil­ity. Jour­nals want a quiet life and will not face down the aca­demics who are their life­blood. Will Na­ture now go back to Mann and threaten to with­draw his pa­per if he doesn’t pro­duce the code for his con­fi­dence in­ter­val calcu­la­tions? It is un­likely in the ex­treme. Un­til poli­ti­ci­ans and jour­nals en­force the shar­ing of data, the pub­lic can gain lit­tle as­surance that there is any real need for the fi­nan­cial sac­ri­fices they are be­ing asked to ac­cept. Tak­ing steps to as­sist the pro­cess of repli­ca­tion will do much to im­prove the con­duct of cli­ma­tol­ogy and to en­sure that its find­ings are solidly based, but in the case of pa­pers of pivotal im­por­tance poli­ti­ci­ans must also go fur­ther. Where a pa­per like the Hockey Stick ap­pears to be cen­tral to a set of policy de­mands or to the shap­ing of pub­lic opinion, it is not cred­ible for poli­cy­mak­ers to stand back and wait for the sci­en­tific com­mu­nity to test the ve­rac­ity of the find­ings over the years fol­low­ing pub­li­ca­tion. Repli­ca­tion and falsifi­ca­tion are of lit­tle use if they hap­pen af­ter policy de­ci­sions have been made. The next les­son of the Hockey Stick af­fair is that if gov­ern­ments are truly to have as­surance that cli­mate sci­ence is a sound ba­sis for de­ci­sion-mak­ing, they will have to set up a for­mal pro­cess for repli­cat­ing key pa­pers, one in which the over­sight role is pe­formed by sci­en­tists who are gen­uinely in­de­pen­dent and who have no fi­nan­cial in­ter­est in the out­come.

    Mont­ford, An­drew (2011-06-06). The Hockey Stick Illu­sion (pp. 379-383). Stacey Arts. Kin­dle Edi­tion.