Forecasting Newsletter: September 2020.



  • Highlights

  • Prediction Markets & Forecasting Platforms

  • In The News

  • Hard To Categorize

  • Long Content

Sign up here or browse past newsletters here.

Prediction Markets & Forecasting Platforms

Metaculus updated their track record page. You can now look at accuracy across time, at the distribution of brier scores, and a calibration graph. They also have a new black swan question: When will US metaculus users face an emigration crisis?.

Good Judgement Open has a thread in which forecasters share and discuss tips, tricks and experiences. An account is needed to browse it.

Augur modifications in response to higher ETH prices. Some unfiltered comments on reddit

An overview of PlotX, a new decentralized prediction protocol/​marketplace. PlotX focuses on non-subjective markets that can be programmatically determined, like the exchange rate between currencies or tokens.

A Replication Markets participant wrote What’s Wrong with Social Science and How to Fix It: Reflections After Reading 2578 Papers. See also: An old long-form introduction to Replication Markets.

Georgetown’s CSET is attempting to use forecasting to influence policy. A seminar discussing their approach Using Crowd Forecasting to Inform Policy with Jason Matheny is scheduled for the 19th of October. But their current forecasting tournament, foretell, isn’t yet very well populated, and the aggregate isn’t that good because participants don’t update all that often, leading to sometimes clearly outdated aggregates. Perhaps because of this relative lack of competition, my team is in 2nd place at the time of this writting (with myself at #6, Eli Lifland at #12 and Misha Yagudin at #21). You can join foretell here.

There is a new contest on Hypermind, The Long Fork Project, which aims to predict the impact of a Trump or a Biden victory in November, with $20k in prize money. H/​t to user ChickCounterfly.

The University of Chicago’s Effective Altruism group is hosting a forecasting tournament between all interested EA college groups starting October 12th, 2020. More details here

In the News

News media sensationalizes essentially random fluctuations on US election odds caused by big bettors entering prediction markets such as Betfair, where bets on the order of $50k can visibly alter the market price. Simultaneously, polls/​models and prediction market odds have diverged, because a substantial fraction of bettors lend credence to the thesis that polls will be biased as in the previous elections, even though polling firms seem to have improved their methods.

Red Cross and Red Crescent societies have been trying out forecast based financing. The idea is to create forecasts and early warning indicators for some negative outcome, such as a flood, using weather forecasts, satellite imagery, climate models, etc, and then release funds automatically if the forecast reaches a given threshold, allowing the funds to be put to work before the disaster happens in a more automatic, fast and efficient manner. Goals and modus operandi might resonate with the Effective Altruism community: > “In the precious window of time between a forecast and a potential disaster, FbF releases resources to take early action. Ultimately, we hope this early action will be more effective at reducing suffering, compared to waiting until the disaster happens and then doing only disaster response. For example, in Bangladesh, people who received a forecast-based cash transfer were less malnourished during a flood in 2017.” (bold not mine)

Prediction Markets’ Time Has Come, but They Aren’t Ready for It. Prediction markets could have been useful for predicting the spread of the pandemic (see:, or for informing presidential election consequences (see: Hypermind above), but their relatively small size makes them less informative. Blockchain based prediction technologies, like Augur, Gnosis or Omen could have helped bypass US regulatory hurdles (which ban many kinds of gambling), but the recent increase in transaction fees means that “everything below a $1,000 bet is basically economically unfeasible”

Floods in India and Bangladesh:

The many tribes of 2020 election worriers: An ethnographic report by the Washington Post.

Electricity time series demand and supply forecasting startup raises $8 million. I keep seeing this kind of announcement; doing forecasting well in an underforecasted domain seems to be somewhat profitable right now, and it’s not like there is an absence of domains to which forecasting can be applied. This might be a good idea for an earning-to-give startup.

NSF and NASA partner to address space weather research and forecasting. Together, NSF and NASA are investing over $17 million into six, three-year awards, each of which contributes to key research that can expand the nation’s space weather prediction capabilities.

In its monthly report, OPEC said it expects the pandemic to reduce demand by 9.5 million barrels a day, forecasting a fall in demand of 9.5% from last year, reports the Wall Street Journal

Some criticism of Gnosis, a decentralized prediction markets startup, by early investors who want to cash out. Here is a blog post by said early investors; they claim that “Gnosis took out what was in effect a 3+ year interest-free loan from token holders and failed to deliver the products laid out in its fundraising whitepaper, quintupled the size of its balance sheet due simply to positive price fluctuations in ETH, and then launched products that accrue value only to Gnosis management.”

What a study of video games can tell us about being better decision makers ($), a frustratingly well-paywalled, yet exhaustive, complete and informative overview of the IARPA’s FOCUS tournament:

To study what makes someone good at thinking about counterfactuals, the intelligence community decided to study the ability to forecast the outcomes of simulations. A simulation is a computer program that can be run again and again, under different conditions: essentially, rerunning history. In a simulated world, the researchers could know the effect a particular decision or intervention would have. They would show teams of analysts the outcome of one run of the simulation and then ask them to predict what would have happened if some key variable had been changed.

Negative Examples

Why Donald Trump Isn’t A Real Candidate, In One Chart, wrote 538 in 2015.

For this reason alone, Trump has a better chance of cameoing in another “Home Alone” movie with Macaulay Culkin — or playing in the NBA Finals — than winning the Republican nomination.

Travel CFOs Hesitant on Forecasts as Pandemic Fogs Outlook, reports the Wall Street Journal.

“We’re basically prevented from saying the word ‘forecast’ right now because whatever we’s wrong,” said Shannon Okinaka, chief financial officer at Hawaiian Airlines. “So we’ve started to use the word ‘planning scenarios’ or ‘planning assumptions.’”

Long Content

Andrew Gelman et al. release Information, incentives, and goals in election forecasts.

  • Neither The Economist’s model nor 538′s are fully Bayesian. In particular, they are not martingales, that is, their current probability is not the expected value of their future probability.

    campaign polls are more stable than every before,and even the relatively small swings that do appear can largely be attributed to differential nonresponse

    Regarding predictions for 2020, the creator of the Fivethirtyeight forecast writes, “we think it’s appropriate to make fairly conservative choices especially when it comes to the tails of your distributions. Historically this has led 538 to well-calibrated forecasts (our 20%s really mean 20%)” (Silver, 2020b). But conservative prediction corresponds can produce a too-wide interval, one that plays it safe by including extra uncertainty. In other words, conservative forecasts should lead to underconfidence: intervals whose coverage is greater than advertised. And, indeed, according to the calibration plot shown by Boice and Wezerek (2019) of Fivethirtyeight’s political forecasts, in this domain 20% for them really means 14%, and 80% really means 88%.

The Literary Digest Poll of 1936. A poll so bad that it destroyed the magazine.

  • Compare the Literary Digest and Gallup polls of 1936 with The New York Times’s model of 2016 and 538′s 2016 forecast, respectively.

    In retrospect, the polling techniques employed by the magazine were to blame. Although it had polled ten million individuals (of whom 2.27 million responded, an astronomical total for any opinion poll),[5] it had surveyed its own readers first, a group with disposable incomes well above the national average of the time (shown in part by their ability to afford a magazine subscription during the depths of the Great Depression), and those two other readily available lists, those of registered automobile owners and that of telephone users, both of which were also wealthier than the average American at the time.

    Research published in 1972 and 1988 concluded that as expected this sampling bias was a factor, but non-response bias was the primary source of the error—that is, people who disliked Roosevelt had strong feelings and were more willing to take the time to mail back a response.

    George Gallup’s American Institute of Public Opinion achieved national recognition by correctly predicting the result of the 1936 election, while Gallup also correctly predicted the (quite different) results of the Literary Digest poll to within 1.1%, using a much smaller sample size of just 50,000.[5] Gallup’s final poll before the election also predicted Roosevelt would receive 56% of the popular vote: the official tally gave Roosevelt 60.8%.

    This debacle led to a considerable refinement of public opinion polling techniques, and later came to be regarded as ushering in the era of modern scientific public opinion research.

Feynman in 1985, answering questions about whether machines will ever be more intelligent than humans.

Why Most Published Research Findings Are False, back from 2005. The abstract reads:

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

Reference class forecasting. Reference class forecasting or comparison class forecasting is a method of predicting the future by looking at similar past situations and their outcomes. The theories behind reference class forecasting were developed by Daniel Kahneman and Amos Tversky. The theoretical work helped Kahneman win the Nobel Prize in Economics.Reference class forecasting is so named as it predicts the outcome of a planned action based on actual outcomes in a reference class of similar actions to that being forecast.

Reference class problem

In statistics, the reference class problem is the problem of deciding what class to use when calculating the probability applicable to a particular case. For example, to estimate the probability of an aircraft crashing, we could refer to the frequency of crashes among various different sets of aircraft: all aircraft, this make of aircraft, aircraft flown by this company in the last ten years, etc. In this example, the aircraft for which we wish to calculate the probability of a crash is a member of many different classes, in which the frequency of crashes differs. It is not obvious which class we should refer to for this aircraft. In general, any case is a member of very many classes among which the frequency of the attribute of interest differs. The reference class problem discusses which class is the most appropriate to use.

  • See also some thoughts on this here

The Base Rate Book by Credit Suisse.

This book is the first comprehensive repository for base rates of corporate results. It examines sales growth, gross profitability, operating leverage, operating profit margin, earnings growth, and cash flow return on investment. It also examines stocks that have declined or risen sharply and their subsequent price performance. We show how to thoughtfully combine the inside and outside views. The analysis provides insight into the rate of regression toward the mean and the mean to which results regress.

Hard To Categorize

Improving decisions with market information: an experiment on corporate prediction markets (sci-hub; archive link)

We conduct a lab experiment to investigate an important corporate prediction market setting: A manager needs information about the state of a project, which workers have, in order to make a state-dependent decision. Workers can potentially reveal this information by trading in a corporate prediction market. We test two different market designs to determine which provides more information to the manager and leads to better decisions. We also investigate the effect of top-down advice from the market designer to participants on how the prediction market is intended to function. Our results show that the theoretically superior market design performs worse in the lab—in terms of manager decisions—without top-down advice. With advice, manager decisions improve and both market designs perform similarly well, although the theoretically superior market design features less mis-pricing. We provide a behavioral explanation for the failure of the theoretical predictions and discuss implications for corporate prediction markets in the field.

The nonprofit Ought organized a forecasting thread on existential risk, where participants display and discuss their probability distributions for existential risk, and outline some reflections on a previous forecasting thread on AI timelines.

A draft report on AI timelines, summarized in the comments

Gregory Lewis has a series of posts related to forecasting and uncertainty:

Estimation of probabilities to get tenure track in academia: baseline and publications during the PhD.

How to think about an uncertain future: lessons from other sectors & mistakes of longtermist EAs. The central thesis is:

Expected value calculations, the favoured approach for EA decision making, are all well and good for comparing evidence backed global health charities, but they are often the wrong tool for dealing with situations of high uncertainty, the domain of EA longtermism.

Discussion by a PredictIt bettor on how he made money by following Nate Silver’s predictions, from r/​TheMotte.

Also on r/​TheMotte, on the promises and deficiencies of prediction markets:

Prediction markets will never be able to predict the unpredictable. Their promise is to be better than all of the available alternatives, by incorporating all available information sources, weighted by experts who are motivated by financial returns.

So, you’ll never have a perfect prediction of who will win the presidential election, but a good prediction market could provide the best possible guess of who will win the presidential election.

To reach that potential, you’d need to clear away the red tape. It would need to be legal to make bets on the market, fees for making transaction need to be low, participants would need faith in the bet adjudication process, and there can’t be limits to the amount you can bet. Signs that you’d succeeded would include sophisticated investors making large bets with a narrow bid/​ask spread.

Unfortunately prediction markets are nowhere close to that ideal today; they’re at most “barely legal,” bet sizes are limited, transaction fees are high, getting money in or out is clumsy and sketchy, trading volumes are pretty low, and you don’t see any hedge funds with “prediction market” desks or strategies. As a result, I put very little stock in political prediction markets today. At best they’re populated by dumb money, and at worst they’re actively manipulated by campaigns or partisans who are not motivated by direct financial returns.

Nate Silver on a small twitter thread on prediction markets: “Most of what makes political prediction markets dumb is that people assume they have expertise about election forecasting because they a) follow politics and b) understand “data” and “markets”. Without more specific domain knowledge, though, that combo is a recipe for stupidity.”

  • Interestingly, I’ve recently found out that 538′s political predictions are probably underconfident, i.e., an 80% happens 88% of the time.

Deloitte forecasts US holiday season retail sales (but doesn’t provide confidence intervals.)

Solar forecast. Sun to leave the quietest part of its cycle, but still remain relatively quiet and not produce world-ending coronal mass ejections, the New York Times reports.

The Foresight Insitute organizes weekly talks; here is one with Samo Burja on long-lived institutions.

Some examples of failed technology predictions.

Last, but not least, Ozzie Gooen on Multivariate estimation & the Squiggly language:

Note to the future: All links are added automatically to the Internet Archive. In case of link rot, go there and input the dead link.

Littlewood’s law states that a person can expect to experience events with odds of one in a million (defined by the law as a “miracle”) at the rate of about one per month.”