AI Forecasting Dictionary (Forecasting infrastructure, part 1)

This post in­tro­duces the AI Fore­cast­ing Dic­tionary, an open-source set of stan­dards and con­ven­tions for pre­cisely in­ter­pret­ing AI and aux­iliary terms. It is the first part in a se­ries of blog posts which mo­ti­vate and in­tro­duce pieces of in­fras­truc­ture in­tended to im­prove our abil­ity to fore­cast novel and un­cer­tain do­mains like AI.

The Dic­tionary is cur­rently in beta, and we’re launch­ing early to get feed­back from the com­mu­nity and quickly figure out how use­ful it is.

Back­ground and motivation

1) Oper­a­tional­i­sa­tion is an un­solved prob­lem in forecasting

A key challenge in (AI) fore­cast­ing is to write good ques­tions. This is tricky be­cause we want ques­tions which both cap­ture im­por­tant un­cer­tain­ties, and are suffi­ciently con­crete that we can re­solve them and award points to fore­cast­ers in hind­sight. For ex­am­ple:

Will there be a slow take-off?

is a ques­tion that’s im­por­tant yet too vague.

Will there be a 4-year dou­bling of world out­put be­fore the first 1-year dou­bling of world out­put?

is both im­por­tant and con­crete, yet suffi­ciently far-out that it’s un­clear if stan­dard fore­cast­ing prac­tices will be helpful in re­solv­ing it.

Will there be a Star­craft II agent by the end of 2020 which is at least as pow­er­ful as AlphaS­tar, yet uses <$10.000 of pub­li­cly available com­pute?

is more amenable to stan­dard fore­cast­ing prac­tices, but at the cost of be­ing only tan­gen­tially re­lated to the high-level un­cer­tainty we ini­tially cared about. And so on.

Cur­rently, fore­cast­ing pro­jects rein­vent this wheel of op­er­a­tional­i­sa­tion all the time. There’s usu­ally idiosyn­cratic and time-con­sum­ing pro­cesses of writ­ing and test­ing ques­tions (this might take many hours for a sin­gle ques­tion) [1], and best prac­tices tend to evolve or­gan­i­cally but with­out be­ing sys­tem­at­i­cally recorded and built upon [2].

2) The fu­ture is big, and fore­cast­ing it might re­quire an­swer­ing a lot of questions

This is an em­piri­cal claim which we’ve be­come more con­fi­dent by work­ing in this space over the last year.

One way of see­ing this is by at­tempt­ing to break down an im­por­tant high-level ques­tion into pieces. Sup­pose we want to get a han­dle on AI progress by in­ves­ti­gat­ing key in­puts. We might branch those into progress on hard­ware, soft­ware, and data (in­clud­ing simu­la­tions). We might then branch hard­ware into eco­nomics and al­gorith­mic par­alleliz­abil­ity. To un­der­stand the eco­nomics, we must branch it into the sup­ply and de­mand side, and we must then branch each of those to un­der­stand how they in­ter­face with reg­u­la­tion and in­no­va­tion. This in­volves thou­sands of ac­tors across academia, in­dus­try and gov­ern­ment, and hun­dreds of differ­ent met­rics for track­ing progress of var­i­ous kinds. And we’ve only done a brief depth-first search on one of the branches of the hard-soft­ware-data tree, which in turn is just one way of ap­proach­ing the AI fore­cast­ing prob­lem.

Another way of guessti­mat­ing this: the AI Im­pacts archives con­tains roughly 140 ar­ti­cles. Sup­pose this is 10% of the num­ber of ar­ti­cles they’d need to ac­com­plish their mis­sion. If they each con­tains 1-30 un­cer­tain claims that we’d ideally like to gather es­ti­mates on, that’s 1400 to 42000 un­cer­tain­ties—each of which would ad­mit many differ­ent ways of be­ing suffi­ciently op­er­a­tional­ised. For refer­ence, over the 4 years of the Good Judge­ment Pro­ject, roughly 500 ques­tions were an­swered.

We’d of course be able to prune this space by fo­cus­ing on the most im­por­tant ques­tions. Nonethe­less, there seems to be a plau­si­ble case that scal­ing our abil­ity to an­swer many ques­tions is im­por­tant if we want our fore­cast­ing efforts to suc­ceed.

We see some ev­i­dence of this from the SciCast pro­ject, a pre­dic­tion tour­na­ment on sci­ence and tech­nol­ogy that ran from 2013-2015. The tour­na­ment or­ga­niz­ers note the im­por­tance of scal­ing ques­tion gen­er­a­tion through tem­plates and the cre­ation of a style guide. (See the 2015 An­nual re­port, p. 86.)

3) So in or­der to fore­cast AI we must achieve economies-of-scale – mak­ing it cheap to write and an­swer the marginal ques­tion by effi­ciently reusing work across them.

AI Fore­cast­ing Dictionary

As a piece of the puz­zle to solve the above prob­lems, we made the AI Fore­cast­ing Dic­tionary. It is an open-source set of stan­dards and con­ven­tions for pre­cisely in­ter­pret­ing AI and aux­iliary terms.

Here’s an ex­am­ple en­try:

See also: Job
A job is au­tomat­able at a time t if a ma­chine can out­perform the me­dian-skil­led em­ployee, with 6 months of train­ing or less, at 10,000x the cost of the me­dian em­ployee or less. Un­less oth­er­wise speci­fied, the date of au­toma­tion will taken to be the first time this thresh­old is crossed.
*As of 2019, Ele­va­tor Operator
*As of 2019, Am­bu­lance Driver
*As of 2019, Epidemiologist

(This defi­ni­tion is based on Luke Muelhauser’s here.)

There are sev­eral mechanisms whereby build­ing a dic­tio­nary helps solve the prob­lems out­lined above.

Less over­head for writ­ing and fore­cast­ing questions

The dic­tio­nary re­duces over­head in two ways: writ­ers don’t have to rein­vent the wheel when­ever they op­er­a­tional­ise a new thought, and fore­cast­ers can re­duce the drag of con­stantly in­ter­pret­ing and un­der­stand­ing new re­s­olu­tions. This makes it cheaper to both gen­er­ate and an­swer the marginal ques­tion.

A plat­form for spread­ing high ini­tial costs over many fu­ture use cases

There are a num­ber of com­mon pit­falls that can make a seem­ingly valid ques­tion am­bigu­ous or mis­lead­ing. For ex­am­ple, pos­i­tively re­solv­ing the ques­tion:

Will an AI lab have been na­tion­al­ized by 2024?

by the US gov­ern­ment na­tion­al­is­ing GM as a re­sponse to a fi­nan­cial crisis, yet GM nonethe­less hav­ing a self-driv­ing car re­search di­vi­sion.

Or fore­cast­ing:

When will there be a su­per­hu­man An­gry Birds agent us­ing no hard­coded knowl­edge?

and re­al­iz­ing that there seems to be lit­tle ac­tive in­ter­est in the yearly bench­mark com­pe­ti­tion (with perfor­mance even de­clin­ing over years). This means that the prob­a­bil­ity en­tirely de­pends on whether any­one with enough money and com­pe­tence de­cides to work on it, as op­posed to what key com­po­nents make An­gry Birds difficult (e.g. physics-based simu­la­tion and plan­ning) and how fast progress is in those do­mains.

Care­fully avoid­ing such pit­falls comes with a high ini­tial cost when writ­ing the ques­tion. We can make that cost worth it by en­sur­ing it is amor­tized across many fu­ture ques­tions, and broadly used and built upon. A Dic­tionary is a piece of in­fras­truc­ture that pro­vides a stan­dard­ised way of do­ing this. If some­one spends a lot of time figur­ing out how to deal with a tricky edge case or a “spu­ri­ous re­s­olu­tion”, there is now a Schel­ling point where they can store that work, and ex­pect fu­ture users to read it (as well as where fu­ture users can ex­pect them to have stored it).

Ver­sion management

When re­solv­ing and scor­ing quan­ti­ta­tive fore­cast­ing ques­tions, it’s im­por­tant to know ex­actly what ques­tion the fore­caster was an­swer­ing. This need for pre­ci­sion of­ten con­flicts with the need to im­prove the re­s­olu­tion con­di­tions from ques­tions as we learn and stress-test them over time. For the Dic­tionary, we can use best prac­tices for soft­ware ver­sion man­age­ment to help solve this prob­lem. As of this writ­ing, the Dic­tionary is still in beta, with the lat­est re­lease be­ing v0.3.0.

Open-source serendipity

The Dic­tionary might be use­ful not just for fore­cast­ing, but also for other con­texts where pre­cisely defined AI terms are im­por­tant. We open-sourced it in or­der to al­low peo­ple to ex­per­i­ment with such use cases. If you do so in a sub­stan­tial way, please let us know.

How to use the dictionary

If you use the Dic­tionary for fore­cast­ing pur­poses, please refer­ence it to help es­tab­lish it as a stan­dard of in­ter­pre­ta­tion.

One way of do­ing this is by ap­pend­ing the tag [ai-dict-vX.Y.Z] at the end of the rele­vant string. For ex­am­ple:

I pre­dict that image clas­sifi­ca­tion will be made ro­bust against un­re­stricted ad­ver­sar­ial ex­am­ples by 2023. [ai-dict-v2]


Will there be a su­per­hu­man Star­craft agent trained us­ing less than $10.000 of pub­li­cly available com­pute by 2025? [ai-dict-v0.4]

In some cases you might want to tweak or change the defi­ni­tions of a term to match a par­tic­u­lar use case, thereby de­part­ing from the Dic­tionary con­ven­tion. If so, then you SHOULD mark the terms re­ceiv­ing a non-stan­dard in­ter­pre­ta­tion with the “^” sym­bol. For ex­am­ple:

I ex­pect un­su­per­vised lan­guage mod­els to be hu­man-level^ by 2024. [ai-dict-v1.3]

You might also want to add the fol­low­ing no­tice:

For pur­poses of re­s­olu­tion, these terms are in­ter­preted in ac­cor­dance with the Tech­ni­cal AI Fore­cast­ing Re­s­olu­tion Dic­tionary vX.Y.Z, available at Any term whose in­ter­pre­ta­tion de­liber­ately de­parts from this stan­dard has been marked with a ^.”

How to con­tribute to the dictionary

The AI Fore­cast­ing Dic­tionary is open-source, and you can con­tribute by mak­ing pull-re­quests to our GitHub or sug­ges­tions in the Google Doc ver­sion (more de­tails here). We es­pe­cially wel­come:

  • At­tempts to in­tro­duce novel defi­ni­tions that cap­ture im­por­tant terms in AI (cur­rent ex­am­ples in­clude: “mod­ule”, “trans­for­ma­tive AI” and “com­pute (train­ing)”)

  • Ex­am­ples of fore­cast­ing ques­tions which you wrote and which ended up solv­ing/​mak­ing progress on some tricky piece of op­er­a­tional­i­sa­tion, such that oth­ers can build on that progress


[1] Some peo­ple might be com­pel­led by an anal­ogy to math­e­mat­ics here: most of the work of­ten lies in set­ting up the right for­mal­ism and prob­lem for­mu­la­tion rather than in the ac­tual proof (for ex­am­ple, Nash’s origi­nal fixed point the­o­rems in game the­ory aren’t that difficul­tle once the set-up is in place, but re­al­is­ing why and how this kind of set-up was ap­pli­ca­ble to a large class of im­por­tant prob­lems was highly non-triv­ial).

[2] English Com­mon Law is the clear ex­am­ple of how defi­ni­tions and poli­cies evolve over time to crys­tal­lize judge­ments and wis­dom.

No comments.