[Part 1] Amplifying generalist research via forecasting – Models of impact and challenges

This post cov­ers our mod­els of im­pact and challenges with our ex­plo­ra­tion in am­plify­ing gen­er­al­ist re­search us­ing fore­cast­ing. It is ac­com­panied by a sec­ond post with a high-level de­scrip­tion of those mod­els, and more de­tailed de­scrip­tion of ex­per­i­ment set-up and re­sults.

Many of the world’s most press­ing prob­lems re­quire in­tel­lec­tual progress to solve [1]. Find­ing ways to in­crease the rate of in­tel­lec­tual progress might be a highly promis­ing way of solv­ing those prob­lems.

One com­po­nent of this is gen­er­al­ist re­search: the abil­ity to judge and syn­the­sise claims across many differ­ent fields with­out de­tailed spe­cial­ist knowl­edge of those fields, in or­der to for ex­am­ple pri­ori­tise po­ten­tial new cause ar­eas or al­lo­cate grant fund­ing. For ex­am­ple, this skill is ex­pected by or­gani­sa­tions at the EA Lead­ers Fo­rum to be one of the high­est de­manded skills for their or­gani­sa­tions over the com­ing 5 years (2018 sur­vey, 2019 sur­vey).

In light of this, we re­cently tested a method of in­creas­ing the scale and qual­ity of gen­er­al­ist re­search, ap­plied to re­search­ing the in­dus­trial rev­olu­tion [2], us­ing Fore­told.io (an on­line pre­dic­tion plat­form).

In par­tic­u­lar, we found that, when faced with claims like:

“Pre-In­dus­trial Bri­tain had a le­gal cli­mate more fa­vor­able to in­dus­tri­al­iza­tion than con­ti­nen­tal Europe”

And

“Pre-In­dus­trial Revolu­tion, av­er­age French wage was what per­cent of the Bri­tish wage?”

a small crowd of fore­cast­ers re­cruited from the EA and ra­tio­nal­ity com­mu­ni­ties very suc­cess­fully pre­dicted the judge­ments of a trusted gen­er­al­ist re­searcher, with a benefit-cost ra­tio of around 73% com­pared to the origi­nal re­searcher. They also out­performed a group of ex­ter­nal on­line crowd­work­ers.

More­over, we be­lieve this method can be scaled to an­swer many more ques­tions than a sin­gle re­searcher could, as well as to have ap­pli­ca­tion in do­mains other than re­search, like grant­mak­ing, hiring and re­view­ing con­tent.

We pre­limi­nar­ily re­fer to this method as “am­plifi­ca­tion” given its similar­ity to ideas from Paul Chris­ti­ano’s work on Iter­ated Distil­la­tion and Am­plifi­ca­tion in AI al­ign­ment (see e.g. this).

This was an ex­plo­ra­tory pro­ject whose pur­pose was to build in­tu­ition for sev­eral pos­si­ble challenges. It cov­ered sev­eral ar­eas that could be well suited for more nar­row, tra­di­tional sci­en­tific stud­ies later on. As such, the sam­ple size was small and no sin­gle re­sult was highly ro­bust.

How­ever, it did lead to sev­eral medium-sized take­aways that we think should be use­ful for in­form­ing fu­ture re­search di­rec­tions and prac­ti­cal ap­pli­ca­tions.

This post be­gins with a brief overview of our re­sults. We then share some mod­els of why the cur­rent pro­ject might be im­pact­ful and ex­cit­ing, fol­lowed by some challenges this ap­proach faces.

Overview of the set-up and results

(This sec­tion gives a very cur­sory overview of the set-up and re­sults. A de­tailed re­port can be found in this post.)

The ba­sic set-up of the pro­ject is shown in the fol­low­ing di­a­gram, and de­scribed be­low.

A two-sen­tence ver­sion would be:

Fore­cast­ers pre­dicted the con­clu­sions that would be reached by Eliz­a­beth van Norstrand, a gen­er­al­ist re­searcher, be­fore she con­ducted a study on the ac­cu­racy of var­i­ous his­tor­i­cal claims. We ran­domly sam­pled a sub­set of re­search claims for her to ac­tu­ally eval­u­ate. And since we can set that sam­pling prob­a­bil­ity ar­bi­trar­ily low, this method is not bot­tle­necked by her time.

The be­low graph shows the evolu­tion of the ac­cu­racy of the crowd pre­dic­tion over time, start­ing from Eliz­a­beth Van Nos­trand’s prior. Pre­dic­tions were sub­mit­ted sep­a­rately by two groups of fore­cast­ers: one based on a mailing list with par­ti­ci­pants in­ter­ested in par­ti­ci­pat­ing in fore­cast­ing ex­per­i­ments (re­cruited from effec­tive al­tru­ism-ad­ja­cent events and other fore­cast­ing plat­forms), and one re­cruited from Positly, an on­line plat­form for crowd­work­ers.

The y-axis shows the ac­cu­racy score on a log­a­r­ith­mic scale, and the x-axis shows how far along the ex­per­i­ment is. For ex­am­ple, 14 out of 28 days would cor­re­spond to 50%. The thick lines show the av­er­age score of the ag­gre­gate pre­dic­tion, across all ques­tions, at each time-point. The shaded ar­eas show the stan­dard er­ror of the scores, so that the graph might be in­ter­preted as a guess of how the two com­mu­ni­ties would pre­dict a ran­dom new ques­tion.

One of our key take­aways from the ex­per­i­ment is that our sim­ple al­gorithm for ag­gre­gat­ing pre­dic­tions performed sur­pris­ingly well in pre­dict­ing Eliz­a­beth’s re­search out­put—but only for the net­work-ad­ja­cent fore­cast­ers.

Another way to un­der­stand the perfor­mance of the ag­gre­gate is to note that the ag­gre­gate of net­work-ad­ja­cent fore­cast­ers had an av­er­age log score of −0.5. To get a rough sense of what that means, it’s the score you’d get by be­ing 70% con­fi­dent in a bi­nary event, and be­ing cor­rect (though note that this bi­nary com­par­i­son merely serves to provide in­tu­ition, there are tech­ni­cal de­tails mak­ing the com­par­i­son to a dis­tri­bu­tional set­ting a bit tricky).

By com­par­i­son, the crowd­work­ers and Eliz­a­beth’s pri­ors had a very poor log score of around −4. This is roughly similar to the score you’d get if you pre­dict an event to be ~5% likely, and it still hap­pens.

We also calcu­lated a benefit/​cost-ra­tio, as fol­lows:

Benefit/​cost ra­tio = % value pro­vided by fore­cast­ers rel­a­tive to the eval­u­a­tor /​ % cost of fore­cast­ers rel­a­tive to the evaluator

We mea­sured “value pro­vided” as the re­duc­tion in un­cer­tainty weighted by the im­por­tance of the ques­tions on which un­cer­tainty was re­duced.

Re­sults were as fol­lows.

In other words, each unit of re­source in­vested in the net­work-ad­ja­cent fore­cast­ers pro­vided 72% as much re­turns as in­vest­ing it in Eliz­a­beth di­rectly, and each unit in­vested in the crowd­work­ers pro­vided nega­tive re­turns, as they tended to be less ac­cu­rate than Eliz­a­beth’s prior.

Over­all, we ten­ta­tively view this as an ex­is­tence proof of the pos­si­bil­ity of am­plify­ing gen­er­al­ist re­search, and in the fu­ture are in­ter­ested in ob­tain­ing more rigor­ous re­sults and op­ti­mis­ing the benefit-cost ra­tio.

Models of impact

This sec­tion sum­marises some differ­ent per­spec­tives on what the cur­rent ex­per­i­ment is try­ing to ac­com­plish and why that might be ex­cit­ing.

There are sev­eral per­spec­tives here given that the ex­per­i­ment was de­signed to ex­plore mul­ti­ple rele­vant ideas, rather than test­ing a par­tic­u­lar, nar­row hy­poth­e­sis.

As a re­sult, the cur­rent de­sign is not op­ti­mis­ing very strongly for any of these pos­si­ble uses, and it is also plau­si­ble that its im­pact and effec­tive­ness will vary widely be­tween uses.

To sum­marise, the mod­els are as fol­lows.

  • Miti­gat­ing ca­pac­ity bot­tle­necks. The effec­tive al­tru­ism and ra­tio­nal­ity com­mu­ni­ties face rather large bot­tle­necks in many ar­eas, such as al­lo­cat­ing fund­ing, del­e­gat­ing re­search, vet­ting tal­ent and re­view­ing con­tent. The cur­rent setup might provide a means of miti­gat­ing some of those—a scal­able mechanism of out­sourc­ing in­tel­lec­tual la­bor.

  • A way for in­tel­lec­tual tal­ent to build and demon­strate their skills. Even if this set-up can’t make new in­tel­lec­tual progress, it might be use­ful to have a venue where ju­nior re­searchers can demon­strate their abil­ity to pre­dict the con­clu­sions of se­nior re­searchers. This might provide an ob­jec­tive sig­nal of epistemic abil­ities not de­pen­dent on de­tailed so­cial knowl­edge.

  • Ex­plor­ing new in­sti­tu­tions for col­lab­o­ra­tive in­tel­lec­tual progress. Academia has a vast back­log of promis­ing ideas for in­sti­tu­tions to help us think bet­ter in groups. Cur­rently we seem bot­tle­necked by prac­ti­cal im­ple­men­ta­tion and product de­vel­op­ment.

  • Get­ting more data on em­piri­cal claims made by the Iter­ated Am­plifi­ca­tion AI al­ign­ment agenda. Th­ese ideas in­spired the ex­per­i­ment. (How­ever, our aim was more prac­ti­cal and short-term, rather than look­ing for the­o­ret­i­cal in­sights use­ful in the long-term.)

  • Ex­plor­ing fore­cast­ing with dis­tri­bu­tions. Lit­tle is known about hu­mans do­ing fore­cast­ing with full dis­tri­bu­tions rather than point es­ti­mates (e.g. “79%”), partly be­cause there hasn’t been easy tool­ing for such ex­per­i­ments. This ex­per­i­ment gave us some cheap data on this ques­tion.

  • Fore­cast­ing fuzzy things. A ma­jor challenge with fore­cast­ing tour­na­ments is the need to con­cretely spec­ify ques­tions; in or­der to clearly de­ter­mine who was right and al­lo­cate pay­outs. The cur­rent ex­per­i­ments tries to get the best of both wor­lds—the in­cen­tive prop­er­ties of fore­cast­ing tour­na­ments and the flex­i­bil­ity of gen­er­al­ist re­search in tack­ling more neb­u­lous ques­tions.

  • Shoot­ing for un­known un­knowns. In ad­di­tion to be­ing an “ex­per­i­ment”, this pro­ject is also an “ex­plo­ra­tion”. We have an in­tu­ition that there are in­ter­est­ing things to be dis­cov­ered at the in­ter­sec­tion of fore­cast­ing, mechanism de­sign, and gen­er­al­ist re­search. But we don’t yet know what they are.

Miti­gat­ing ca­pac­ity bottlenecks

The effec­tive al­tru­ism and ra­tio­nal­ity com­mu­ni­ties face rather large bot­tle­necks in many ar­eas, such as al­lo­cat­ing fund­ing, del­e­gat­ing re­search, vet­ting tal­ent and re­view­ing con­tent.

Pre­dic­tion plat­forms (for ex­am­ple as used with the cur­rent “am­plifi­ca­tion” set-up) might be a promis­ing tool to tackle some of those prob­lems, for sev­eral rea­sons. In brief, they might act as a scal­able way to out­source in­tel­lec­tual la­bor.

First, we’re us­ing quan­ti­ta­tive pre­dic­tions and scor­ing rules. This al­lows sev­eral things.

  • We can di­rectly mea­sure how ac­cu­rate each con­tri­bu­tion was, and sep­a­rately mea­sure how use­ful they were in benefit­ing the ag­gre­gate. The ac­tual calcu­la­tions are quite sim­ple and with some en­g­ineer­ing effort can scale to al­lo­cat­ing credit (in terms of money, points, rep­u­ta­tion etc.) to hun­dreds of users in an in­cen­tive-com­pat­i­ble way.

  • We can ag­gre­gate differ­ent con­tri­bu­tions in an au­to­matic and rigor­ous way [3].

  • We have a shared, pre­cise lan­guage for in­ter­pret­ing con­tri­bu­tions.

Con­trast re­ceiv­ing 100 pre­dic­tions and re­ceiv­ing 20 Google docs. The lat­ter would be pro­hibitively difficult to read through, does not have a straight­for­ward means of ag­gre­ga­tion, and might not even be analysable in an “ap­ples to ap­ples” com­par­i­son.

How­ever, the big cost we pay to en­able these benefits is that we are adding for­mal­ism, and con­strain­ing peo­ple to ex­press their be­liefs within the par­tic­u­lar for­mal­ism/​on­tol­ogy of prob­a­bil­ities and dis­tri­bu­tions. We dis­cuss this more in the sec­tion on challenges be­low.

Se­cond, we’re us­ing an in­ter­net plat­form. This makes it eas­ier for peo­ple from differ­ent places to col­lab­o­rate, and to or­ganise and analyse their con­tri­bu­tions. More­over, given the benefits of quan­tifi­ca­tion noted above, we can freely open the tour­na­ment to peo­ple with­out sub­stan­tial cre­den­tials, since we’re not con­strained in our ca­pac­ity to eval­u­ate their work.

Third, we’re us­ing a mechanism speci­fi­cally de­signed to over­come ca­pac­ity bot­tle­necks. The key to scal­a­bil­ity is that fore­cast­ers do not know which claims will be eval­u­ated, and so are in­cen­tivised to make their hon­est, most ac­cu­rate pre­dic­tions on all of them. This re­mains true even as many more claims are added (as long as fore­cast­ers ex­pect re­wards for par­ti­ci­pat­ing re­main similar).

In effect, we’re shift­ing the bot­tle­neck from ac­cess to a few re­searchers to ac­cess to prize money and com­pe­tent fore­cast­ers. It seems highly im­plau­si­ble that all kinds of in­tel­lec­tual work could be cost-effec­tively out­sourced this way. How­ever, if some work could be out­sourced and performed at, say 10% of the qual­ity, but at only 1% of the cost, that could still be very worth­while. For ex­am­ple, in try­ing to re­view hun­dreds of fac­tual claims, the ini­tial fore­cast­ing could be used as an ini­tial, wide-sweep­ing filter, grab­bing the low-hang­ing fruit; but also iden­ti­fy­ing which ques­tions are more difficult, and will need at­ten­tion from more se­nior re­searchers.

Over­all, this is a model for how things might work, but it is as of yet highly un­cer­tain whether this tech­nique will ac­tu­ally be effec­tive in tack­ling bot­tle­necks of var­i­ous kinds. We provide some pre­limi­nary data from this ex­per­i­ment in the “Cost-effec­tive­ness” sec­tion be­low.

A way for in­tel­lec­tual tal­ent to build and demon­strate their skills

The fol­low­ing seems broadly true to some of us:

  • Some­one who can pre­dict my be­liefs likely has a good model of how I think. (E.g. “I ex­pect you to re­ject this pa­per’s val­idity based on the sec­ond ex­per­i­ment, but also think you’d change your mind if you thought they had pre-reg­istered that method­ol­ogy”.)

  • Some­one who can both pre­dict my be­liefs and dis­agrees with me is some­one I should listen to care­fully. They seem to both un­der­stand my model and still re­ject it, and this sug­gests they know some­thing I don’t.

  • It seems pos­si­ble for per­son X to pre­dict a fair num­ber of a more epistem­i­cally com­pe­tent per­son Y’s be­liefs—even be­fore per­son X is as epistem­i­cally com­pe­tent as Y. And in that case, do­ing so is ev­i­dence that per­son X is mov­ing in the right di­rec­tion.

If these claims are true, we might use some novel ver­sions of fore­cast­ing tour­na­ments as a scal­able sys­tem to iden­tify and de­velop epistemic tal­ent. This po­ten­tial benefit looks quite differ­ent from us­ing fore­cast­ing tour­na­ments to help us solve novel prob­lems or gain bet­ter or cheaper in­for­ma­tion than we could oth­er­wise.

Cur­rently there is no “driver’s li­cense” for ra­tio­nal­ity or effec­tive al­tru­ism. De­mon­strat­ing your abil­ities re­quires nav­i­gat­ing a sys­tem of read­ing and writ­ing cer­tain blog posts, find­ing con­nec­tions to more se­nior peo­ple, and go­ing through work tri­als tai­lored to par­tic­u­lar or­gani­sa­tions. This sys­tem does not scale very well, and also of­ten re­quires a so­cial knowl­edge and abil­ity to “be in the right place at the right time” which does not nec­es­sar­ily strongly cor­re­late with pure epistemic abil­ity.

It seems very im­plau­si­ble that open fore­cast­ing tour­na­ments could solve the en­tire prob­lem here. But it seems quite plau­si­ble that it could offer im­prove­ments on the mar­gin, and be­come a re­li­able cre­den­tial­ing mechanism for a limited class of non-triv­ial epistemic abil­ities.

For ex­am­ple, EA stu­dent groups with mem­bers con­sid­er­ing cause pri­ori­ti­sa­tion ca­reer paths might or­ganise tour­na­ments where their mem­bers fore­cast the con­clu­sions of OpenPhil write-ups, or main­tain and up­date their own dis­tri­bu­tions over key vari­ables in GiveWell’s cost-effec­tive­ness mod­els.

By run­ning this ex­per­i­ment, writ­ing up the re­sults, and im­prov­ing the Fore­told plat­form, we hope to provide in­fras­truc­ture that will al­low oth­ers in­ter­ested in this benefit to run their own ex­per­i­ments.

Ex­plor­ing new in­sti­tu­tions for col­lab­o­ra­tive in­tel­lec­tual progress

Many of our cur­rent most im­por­tant in­sti­tu­tions, like gov­ern­ments and uni­ver­si­ties, run on mechanisms de­signed hun­dreds of years ago, be­fore fields like microe­co­nomics and statis­tics were de­vel­oped. They suffer from many pre­dictable and well-un­der­stood in­cen­tive prob­lems, such as poor repli­ca­tion rates of sci­en­tific find­ings fol­low­ing from a need to op­ti­mise for pub­li­ca­tions; the elec­tion of dan­ger­ous lead­ers due to the use of prov­ably sub­op­ti­mal vot­ing sys­tems; or the failure to ad­e­quately fund pub­lic goods like high-qual­ity ex­pla­na­tions of difficult con­cepts due to free-rider prob­lem, just to name a few.

The aca­demic liter­a­ture in eco­nomics and mechanism de­sign has a vast back­log of de­signs for new in­sti­tu­tions that could solve these and other prob­lems. One key bot­tle­neck now seems to be im­ple­men­ta­tion.

For ex­am­ple, ethereum founder Vi­talik Bu­terin has ar­gued that the key skill re­quired is product de­vel­op­ment: mak­ing novel mechanisms with bet­ter in­cen­tives work in prac­tice (search for “product peo­ple” in linked in­ter­view).

Similarly, Robin Han­son has ar­gued that there is a large, promis­ing liter­a­ture on more effec­tive in­sti­tu­tions, but “what we need most [… is lots of con­crete tri­als.] To get in­volved in the messy de­tails of an or­ga­ni­za­tion, and just try out differ­ent vari­a­tions un­til [we] see some­thing that ac­tu­ally works” [4], [5].

Part of the spirit of the cur­rent ex­per­i­ment is an at­tempt to do just this, and, in par­tic­u­lar, to do so in the do­main of re­search in­tel­lec­tual progress.

Get­ting more data on em­piri­cal claims made by the Iter­ated Am­plifi­ca­tion AI al­ign­ment agenda

The key mechanism un­der­ly­ing this ex­per­i­ment, and its use of pre­dic­tion and ran­domi­sa­tion, is based on ideas from the Iter­ated Am­plifi­ca­tion ap­proach to AI al­ign­ment. Cur­rently groups at Ought, OpenAI and el­se­where are work­ing on test­ing the em­piri­cal as­sump­tions un­der­ly­ing that the­ory.

Com­pared to these groups, the cur­rent ex­per­i­ment had a more prac­ti­cal, short-term aim—to find a “shovel-ready” method of am­plify­ing gen­er­al­ist re­search, that could be ap­plied to make the EA/​ra­tio­nal­ity com­mu­ni­ties more effec­tive already over the com­ing years.

Nonethe­less, po­ten­tial fol­low-ups from this ex­per­i­ment might provide use­ful the­o­ret­i­cal in­sight in that di­rec­tion.

Ex­plor­ing fore­cast­ing with distributions

Lit­tle is known about do­ing fore­cast­ing with full dis­tri­bu­tions (e.g. “I think this is cap­tured by two nor­mals, with means 5 and 10 and var­i­ance 3”) rather than point es­ti­mates (e.g. “79%”). Be­fore the launch of Fore­told, there wasn’t any soft­ware available for eas­ily run­ning such ex­per­i­ments.

This was a quick way of get­ting data on many ques­tions in dis­tri­bu­tional fore­cast­ing:

  • How good are hu­mans at it?

  • What are the main us­abil­ity challenges?

    • In terms of in­tu­itive scor­ing rules?

    • In terms of in­tu­itive yet pow­er­ful in­put for­mats?

  • What are best prac­tices? (For ex­am­ple, us­ing beta rather than log­nor­mal dis­tri­bu­tions when fore­cast­ing some­one else’s pre­dic­tion, or av­er­ag­ing dis­tri­bu­tions with a wide uniform to hedge against large losses)

Fore­cast­ing fuzzy things

A ma­jor challenge with pre­dic­tion mar­kets and fore­cast­ing tour­na­ments is the need to con­cretely spec­ify ques­tions; in or­der to clearly de­ter­mine who was right and al­lo­cate pay­outs.

Often, this means that these mechanisms are limited to an­swer­ing ques­tions like:

> “What will the high­est perfor­mance of an al­gorith­mic bench­mark x be at time t?”

Even though what we of­ten care about is some­thing more neb­u­lous, like:

> “How close will we be to AGI at time t?”

The up­side of this pre­ci­sion is that it en­ables us to use quan­ti­ta­tive meth­ods to es­ti­mate perfor­mance, com­bine pre­dic­tions, and al­lo­cate re­wards, as de­scribed above.

The cur­rent ex­per­i­ments try to get the best of both wor­lds: the in­cen­tive prop­er­ties of fore­cast­ing tour­na­ments and the flex­i­bil­ity of gen­er­al­ist re­search in tack­ling more neb­u­lous ques­tions. The pro­posed solu­tion to this prob­lem is sim­ply to have one or many trusted eval­u­a­tors who de­cide on the truth of a ques­tion, and then pre­dict their judge­ments as op­posed to the un­der­ly­ing ques­tion [6].

(Pre­vi­ously some of the au­thors set up the AI Fore­cast­ing Re­s­olu­tion Coun­cil to en­able such flex­ible re­s­olu­tion to also be used on AI ques­tions.)

Shoot­ing for un­known unknowns

This is re­lated to the mind­set of “prospect­ing for gold”. To a cer­tain ex­tent, we think that we have a po­ten­tially re­li­ably in­side view, a cer­tain re­search taste which is worth fol­low­ing and pay­ing at­ten­tion to, be­cause we are cu­ri­ous what we might find out.

A draw­back with this is that it en­ables prac­tices like p-hack­ing/​pub­li­ca­tion bias if re­sults are re­ported se­lec­tively. To miti­gate this, all data from this ex­per­i­ment is pub­li­cly available here [7].

Challenges

This sec­tion dis­cusses some challenges and limi­ta­tions of the cur­rent ex­plo­ra­tion, as well as our ideas for solv­ing some of them. In par­tic­u­lar, we con­sider:

  • Com­plex­ity and un­fa­mil­iar­ity of ex­per­i­ment. The cur­rent ex­per­i­ment had many tech­ni­cal mov­ing parts. This makes it challeng­ing to un­der­stand for both par­ti­ci­pants and po­ten­tial clients who want to use it in their own or­gani­sa­tions.

  • Trust in eval­u­a­tions. The ex­tent to which these re­sults are mean­ingful de­pends on your trust in Eliz­a­beth Van Nos­trand’s abil­ity to eval­u­ate ques­tions. We think is partly an in­escapable prob­lem, but also ex­pect clever mechanisms and more trans­parency to be able to make large im­prove­ments.

  • Cor­re­la­tions be­tween pre­dic­tions and eval­u­a­tions. Eliz­a­beth had ac­cess to a filtered ver­sion of fore­caster com­ments when she made her eval­u­a­tions. This in­tro­duces a po­ten­tial source of bias and a “self-fulfilling prophecy” dy­namic in the ex­per­i­ments.

  • Difficulty of con­vert­ing men­tal mod­els into quan­ti­ta­tive dis­tri­bu­tions. It’s hard to turn nu­anced men­tal mod­els into num­bers. We think a solu­tion is to have a “di­vi­sion of la­bor”, where some peo­ple just build mod­els/​write com­ments and oth­ers fo­cus on quan­tify­ing them. We’re work­ing on in­cen­tive schemes that work in this con­text.

  • Anti-cor­re­la­tion be­tween im­por­tance and “out­source­abil­ity”. The in­tel­lec­tual ques­tions which are most im­por­tant to an­swer might be differ­ent from the ones that are eas­iest to out­source, in a way which leaves very lit­tle value on the table in out­sourc­ing.

  • Over­head of ques­tion gen­er­a­tion. Creat­ing good fore­cast­ing ques­tions is hard and time-con­sum­ing, and bet­ter tool­ing is needed to sup­port this.

  • Overly com­pet­i­tive scor­ing rules. Pre­dic­tion mar­kets and tour­na­ments tend to be zero-sum games, with nega­tive in­cen­tives for helping other par­ti­ci­pants or shar­ing best prac­tices. To solve this we’re de­sign­ing and test­ing im­proved scor­ing rules which di­rectly in­cen­tivise col­lab­o­ra­tion.

Com­plex­ity and un­fa­mil­iar­ity of ex­per­i­ment.

The cur­rent ex­per­i­ment has many mov­ing parts and a large in­fer­en­tial dis­tance. For ex­am­ple, in or­der to par­ti­ci­pate, one would need to un­der­stand the math­e­mat­i­cal scor­ing rule, the ques­tion in­put for­mat, the ran­domi­sa­tion of re­solved ques­tions and how ques­tions would be re­solved as dis­tri­bu­tions.

This makes the set-up challeng­ing to un­der­stand to both par­ti­ci­pants and po­ten­tial clients who want to use similar am­plifi­ca­tion set-ups in their own or­gani­sa­tions.

We don’t think these things are in­her­ently com­pli­cated, but have much work to do on ex­plain­ing the set-up and mak­ing the app gen­er­ally ac­cessible.

Trust in eval­u­a­tions.

The ex­tent to which the re­sults are mean­ingful de­pends on one’s trust in Eliz­a­beth Van Nos­trand’s abil­ity to eval­u­ate ques­tions. We chose Eliz­a­beth for the ex­per­i­ment as she has a rep­u­ta­tion for re­li­able gen­er­al­ist re­search (through her blog se­ries on “Epistemic Spot Checks”), and 10+ pub­lic blog posts with eval­u­a­tions of the ac­cu­racy of books and pa­pers.

How­ever, the challenge is that this trust of­ten re­lies on a long his­tory of in­ter­ac­tions with her ma­te­rial, in a way which might be hard to com­mu­ni­cate to third-par­ties.

For fu­ture ex­per­i­ments, we are con­sid­er­ing sev­eral im­prove­ments here.

First, as hinted at above, we can ask fore­cast­ers both about their pre­dic­tions of Eliz­a­beth as well as their own per­sonal be­liefs. We might then ex­pect that those who can both ac­cu­rately pre­dict Eliz­a­beth and dis­agree with her knows some­thing she does not, and so will be weighted more highly in the eval­u­a­tion of the true claim.

Se­cond, we might have set-ups with mul­ti­ple eval­u­a­tors; or more elab­o­rate ways of scor­ing the eval­u­a­tors them­selves (for ex­am­ple based on their abil­ity to pre­dict what they them­selves will say af­ter more re­search).

Third, we might work to have more trans­par­ent eval­u­a­tion pro­cesses, for ex­am­ple in­clud­ing sys­tem­atic rubrics or de­tailed write-ups of rea­son­ing. We must be care­ful here not to “throw out the baby with the bath­wa­ter”. The pur­pose of us­ing judges is af­ter all to ac­cess sub­jec­tive eval­u­a­tions which can’t be eas­ily cod­ified in con­crete re­s­olu­tion con­di­tions. How­ever, there seems to be room for more trans­parency on the mar­gin.

Cor­re­la­tion be­tween pre­dic­tions and eval­u­a­tions.

Eliz­a­beth had ac­cess to a filtered ver­sion of fore­caster com­ments when she made her eval­u­a­tions. Hence the se­lec­tion pro­cess on ev­i­dence af­fect­ing her judge­ments was not in­de­pen­dent from the se­lec­tion pro­cess on ev­i­dence af­fect­ing the ag­gre­gate. This in­tro­duces a po­ten­tial source of bias and a “self-fulfilling prophecy” dy­namic of the ex­per­i­ments.

For fu­ture ex­per­i­ments, we’re con­sid­er­ing ob­tain­ing an ob­jec­tive data-set with clear ground truth, and test the same set-up with­out re­veal­ing the com­ments to Eliz­a­beth, to get data on how se­ri­ous this prob­lem is (or is not).

Difficulty of con­vert­ing men­tal mod­els into quan­ti­ta­tive dis­tri­bu­tions.

In or­der to par­ti­ci­pate in the ex­per­i­ment, a fore­caster has to turn their men­tal mod­els (rep­re­sented in whichever way the hu­man brain rep­re­sents mod­els) into quan­ti­ta­tive dis­tri­bu­tions (which is a for­mat quite un­like that na­tive to our brains), as shown in the fol­low­ing di­a­gram:

Each step in this chain is quite challeng­ing, re­quires much prac­tice to mas­ter, and can re­sult in a loss of in­for­ma­tion.

More­over, we are un­cer­tain how the difficulty of this pro­cess differs across ques­tions of vary­ing im­por­tance. It might be that some of the most im­por­tant con­sid­er­a­tions in a do­main tend to be con­fu­sion-shaped (e.g. “What does it even mean to be al­igned un­der self-im­prove­ment when you can’t re­li­ably rea­son about sys­tems smarter than your­self?”), or very open-ended (e.g. “What new ideas could re­li­ably im­prove the long-term fu­ture?” rather than “How much will sav­ing in in­dex funds benefit fu­ture philan­thropists?”)). Hence fil­ter­ing for ques­tions that are more eas­ily quan­tified might se­lect against ques­tions that are more im­por­tant.

Con­sider some solu­tions. For the do­mains where quan­tifi­ca­tion seems more promis­ing, it seems at least plau­si­ble that there should be pos­si­ble to have some kind of “di­vi­sion of la­bor” be­tween them.

For fu­ture ex­per­i­ments, we’re look­ing to bet­ter sep­a­rate “in­for­ma­tion con­tri­bu­tion” and “nu­mer­i­cal con­tri­bu­tion”, and find ways of re­ward­ing both. Some par­ti­ci­pants might spe­cial­ise in re­search or model-gen­er­a­tion, and oth­ers in turn­ing that re­search into dis­tri­bu­tions.

A challenge here is to ap­pro­pri­ately re­ward users who only sub­mit com­ments but do not sub­mit pre­dic­tions. Since one of the core ad­van­tages of fore­cast­ing tour­na­ments is that they al­low us to pre­cisely and quan­ti­ta­tively mea­sure perfor­mance, it seems plau­si­ble that any solu­tion should try to make use of this fact. (As op­posed to, say, us­ing an in­de­pen­dent up- and down­vot­ing scheme.) As ex­am­ple mechanisms, one might ran­domly show a com­ment to half the users, and re­ward a com­ment based on the perfor­mance of the ag­gre­gate for users who’ve seen it and users who haven’t. Or one might re­lease the com­ments to fore­cast­ers se­quen­tially, and see how much each im­proves the ag­gre­gate. Or one might sim­ply al­low users to vote, but weigh the votes of users with a bet­ter track-record higher.

More­over, in fu­ture ex­per­i­ments with Eliz­a­beth we’ll want to pair her up with a “dis­tri­bu­tion buddy”, whose task is to in­ter­view her to figure out in de­tail what dis­tri­bu­tion best cap­tures her be­liefs, al­low­ing Eliz­a­beth to fo­cus sim­ply on build­ing con­cep­tual mod­els.

Anti-cor­re­la­tion be­tween im­por­tance and “out­source­abil­ity”

Above we men­tioned that the ques­tions eas­iest to quan­tify might be anti-cor­re­lated with the ones that are most im­por­tant. It is also plau­si­ble that the ques­tions which are eas­iest to out­source to fore­cast­ers are not the same as those which are most im­por­tant to re­duce un­cer­tainty on. Depend­ing on the shape of these dis­tri­bu­tions, the ex­per­i­ment might not be cap­ture a lot of value. (For illus­tra­tion, con­sider an overly ex­treme ex­am­ple: sup­pose a ven­ture cap­i­tal­ist tries to am­plify their startup in­vest­ments. The crowd always pre­dicts “no in­vest­ment”, and turn out to be right in 99100 cases: the VC doesn’t in­vest­ment. How­ever, the re­turns for that one case where crowd fails and the VC ac­tu­ally would have in­vested by far dom­i­nate the port­fo­lio.)

Over­head of ques­tion gen­er­a­tion.

The act of cre­at­ing good, fore­castable ques­tions is an art in and of it­self. If done by the same per­son or small team which will even­tu­ally fore­cast the ques­tions, one can rely on much shared con­text and in­tu­ition in in­ter­pret­ing the ques­tions. How­ever, scal­ing these sys­tems to many par­ti­ci­pants re­quires ad­di­tional work in spec­i­fy­ing the ques­tions suffi­ciently clearly. This over­head might be very costly. Espe­cially since we think one of the key fac­tors de­ter­min­ing the use­ful­ness of a fore­cast­ing ques­tion is the ques­tion it­self. How well does it cap­ture some­thing we care about? From ex­pe­rience, writ­ing these ques­tions is hard. In fu­ture we have much work to do to make this pro­cess eas­ier.

A scor­ing rule that dis­cour­ages collaboration

Par­ti­ci­pants were scored based on how much they out­performed the ag­gre­gate pre­dic­tion. This scor­ing ap­proach is similar to the de­fault in pre­dic­tion mar­kets and ma­jor fore­cast­ing tour­na­ments. It has the prob­lem that shar­ing any in­for­ma­tion via com­ment­ing will harm your score (since it will make the perfor­mance of other users, and hence the ag­gre­gate, bet­ter). What’s more, all else re­main­ing the same, do­ing any­thing that helps other users will be worse for your score (such as shar­ing tips and tricks for mak­ing bet­ter pre­dic­tions, or point­ing out eas­ily fix­able mis­takes so they can learn from them).

There are sev­eral prob­lems with this ap­proach and how it a dis­in­cen­tives col­lab­o­ra­tion.

First, it pro­vides an awk­ward change in in­cen­tives for groups who oth­er­wise have reg­u­lar friendly in­ter­ac­tions (such as a team at a com­pany, a uni­ver­sity fac­ulty, or mem­bers of the effec­tive al­tru­ism com­mu­nity).

Se­cond, it causes effort to be wasted as par­ti­ci­pants must de­rive the same key in­sights in­di­vi­d­u­ally, util­is­ing lit­tle di­vi­sion of la­bor (as any shar­ing in­for­ma­tion will just end up hurt­ing their score on the mar­gin). Hav­ing some amount of du­pli­ca­tion of work and think­ing can of course make the sys­tem ro­bust against mis­takes—but we think the op­ti­mal amount is far less than the equil­ibrium un­der the cur­rent scor­ing rule.

In spite of these the­o­ret­i­cal in­cen­tives, it is in­ter­est­ing to note that sev­eral par­ti­ci­pants ac­tu­ally ended up writ­ing de­tailed com­ments. (Though ba­si­cally only aimed at ex­plain­ing their own rea­son­ing, with no col­lab­o­ra­tion and back-and-forth be­tween par­ti­ci­pants ob­served.) This might have been be­cause they knew Eliz­a­beth would see those com­ments, or for some other rea­son.

Nonethe­less, we are work­ing on mod­ify­ing our scor­ing rule in a way which di­rectly in­cen­tivises par­ti­ci­pants to col­lab­o­rate, and ac­tively re­wards helping other users im­prove their mod­els. We hope to re­lease de­tails of for­mal mod­els and prac­ti­cal ex­per­i­ments in the com­ing month.

Footnotes

[1] Ex­am­ples in­clude: AI al­ign­ment, global co­or­di­na­tion, macros­trat­egy and cause pri­ori­ti­sa­tion.

[2] We chose the in­dus­trial rev­olu­tion as a theme since it seems like a his­tor­i­cal pe­riod with many les­sons for im­prov­ing the world. It was a time of rad­i­cal change in pro­duc­tivity along with many so­cietal trans­for­ma­tions, and might hold les­sons for fu­ture trans­for­ma­tions and our abil­ity to in­fluence those.

[3] For ex­am­ple by av­er­ag­ing pre­dic­tions and then weigh­ing by past track-record and time un­til re­s­olu­tion, as done in the Good Judge­ment Pro­ject (among other things).

[4] Some ex­am­ples of nitty-gritty de­tails we no­ticed while do­ing this are:

  • Pay­offs were too small/​the scor­ing scheme too harsh

  • Copy­ing the ag­gre­gate to your dis­tri­bu­tions and then just edit­ing a lit­tle was some­thing nat­u­ral, so we added sup­port in the syn­tax for writ­ing =mul­ti­modal(AG, your pre­dic­tion)

  • Aver­ag­ing with a uniform would have im­proved pre­dic­tions.

  • The marginal value of each ad­di­tional pre­dic­tion was low af­ter the be­gin­ning.

  • Fore­cast­ers were mostly mo­ti­vated by what ques­tions were in­ter­est­ing, fol­lowed by what would give them a higher pay­out, and less by what would be most valuable to the ex­per­i­menters.

[5] For a some­what tan­gen­tial, but po­ten­tially in­ter­est­ing, per­spec­tive, see Feyn­man on mak­ing ex­per­i­ments to figure out nitty-gritty de­tails in or­der to en­able other ex­per­i­ments to hap­pen (search for “rats” in the link).

[6] A fur­ther di­rec­tion we’re con­sid­er­ing is to al­low fore­cast­ers to both pre­dict the judge­ments of eval­u­a­tors and the un­der­ly­ing truth. We might then ex­pect that those pre­dic­tors who both ac­cu­rately fore­cast the judge­ment of the eval­u­a­tor and dis­agree in their own judge­ments, might provide valuable clues about the truth.

[7] For the record, be­fore this ex­per­i­ment we ran two similar, smaller ex­per­i­ment (to catch easy mis­takes and learn more about the set up), with about an or­der of mag­ni­tude less to­tal fore­cast­ing effort in­vested. The ag­gre­gate from these ex­per­i­ments was quite poor at pre­dict­ing the eval­u­a­tions. The data from those ex­per­i­ments can be found here, and more de­tails in Eliz­a­beth’s write-ups here and here.

Par­ti­ci­pate in fu­ture ex­per­i­ments or run your own

Fore­told.io was built as an open plat­form to en­able more ex­per­i­men­ta­tion with pre­dic­tion-re­lated ideas. We have also made data and anal­y­sis calcu­la­tions from this ex­per­i­ment pub­li­cly available.

If you’d like to:

  • Run your own ex­per­i­ments on other questions

  • Do ad­di­tional anal­y­sis on this ex­per­i­men­tal data

  • Use an am­plifi­ca­tion set-up within your organisation

We’d be happy to con­sider pro­vid­ing ad­vice, op­er­a­tional sup­port, and fund­ing for fore­cast­ers. Just com­ment here or reach out to this email.

If you’d like to par­ti­ci­pate as a fore­caster in fu­ture pre­dic­tion ex­per­i­ments, you can sign-up here.

Acknowledgements

Fund­ing for this pro­ject was pro­vided by the Berkeley Ex­is­ten­tial Risk Ini­ti­a­tive and the EA Long-term Fu­ture Fund.

We thank Beth Barnes and Owain Evans for helpful dis­cus­sion.

We are also very thank­ful to all the par­ti­ci­pants.