The Nudgerism section seems to be mushing together various psychology-related things which don’t have much to do with nudging.
Things like downplaying risks in order to prevent panic are at most very loosely related to nudging, and at least as ancient as the practice of placing objects at eye-level. Seems like an over-extension of focusing on “morale” and other Leaders of Men style attributes.
The main overlaps between the book Nudge and the awful The Cognitive Bias That Makes Us Panic About Coronavirus Bloomberg article are 1) they were both written by Cass Sunstein and 2) the one intervention that’s explicitly recommended in the Bloomberg article is publicizing accurate information about coronavirus risk probabilities.
One of the main themes of the nudge movement is that human behavior is an empirical field that can be studied, and one of the main flaws of the thing being called “nudgerism” is making up ungrounded (and often inaccurate) stories about how people will behave (such as what things will induce a “false sense of security”). These stories often are made by people without relevant expertise who don’t even seem to be trying very hard to make accurate predictions.
The British government has a Behavioural Insights Team which is colloquially known as the Nudge Unit; I’d guess that they didn’t have much to do with the screwups that are being called “nudgerism.”
I expect it will be easier to get Metaculus users to make forecasts on pundits’ questions than to get pundits to make forecasts on each other’s questions.
Suggested variant (with dates for concreteness):
Dec 1: deadline for pundits to submit their questionsDec 10: metaculus announces the final version of all the questions they’re using, but does not open marketsDec 20: deadline for pundits & anyone else to privately submit their forecasts (maybe hashed), and metaculus markets openDec 31: current metaculus consensus becomes the official metaculus forecast for the questions, and pundits (& anyone else) can publicize the forecasts that they made by Dec 20
Contestants (anyone who submitted forecasts by Dec 20) mainly get judged based on how they did relative to the Dec 31 metaculus forecast. I expect that they will mostly be pundits making forecasts on their own questions, plus forecasting aficionados.
(We want contestants & metaculus to make their forecasts simultaneously, with neither having access to the other’s forecasts, which is tricky since metaculus is a public platform. That’s why I have the separate deadlines on Dec 20 & Dec 31, with contestants’ forecasts initially private—hopefully that’s a short enough time period so that not much new information should arise, and long enough for people to have time to make forecasts.)
With only a small sample size of questions, it may be more meaningful to evaluate contestants based on how close they came to the official metaculus forecast rather than on how accurate they were (there’s a bias-variance tradeoff). As a contestant does more questions (this year or over multiple years), the comparison with what actually happened becomes more meaningful.
Maybe a nitpick, but the driver’s license posterior of 95% seems too high. (Or at least the claim isn’t stated precisely.) I’d have less than a 95% success rate at guessing the exact name string that appears on someone’s driver’s license. Maybe there’s a middle name between the “Mark” and the “Xu”, maybe the driver’s license says “Marc” or “Marcus”, etc.
I think you can get to 95% with a phone number or a wifi password or similar, so this is probably just a nitpick.
Although maybe not that disproportionate—one recent post was throwing off the search results. Without it, rationalish subreddits still show up a few times on the first couple pages of search results, but not overwhelmingly.
Searching for the phrase on Reddit does turn up a disproportionate number of hits from /r/slatestarcodex. So not LW-exclusive, but maybe unusually common around here. Possibly traceable to Weak Men Are Superweapons:
What is the problem with statements like this?First, they are meant to re-center a category. Remember, people think in terms of categories with central and noncentral members – a sparrow is a central bird, an ostrich a noncentral one. But if you live on the Ostrich World, which is inhabited only by ostriches, emus, and cassowaries, then probably an ostrich seems like a pretty central example of ‘bird’ and the first sparrow you see will be fantastically strange.Right now most people’s central examples of religion are probably things like your local neighborhood church. If you’re American, it’s probably a bland Protestant denomination like the Episcopalians or something.The guy whose central examples of religion are Pope Francis and the Dalai Lama is probably going to have a different perception of religion than the guy whose central examples are Torquemada and Fred Phelps. If you convert someone from the first kind of person to the second kind of person, you’ve gone most of the way to making them an atheist.
What is the problem with statements like this?
First, they are meant to re-center a category. Remember, people think in terms of categories with central and noncentral members – a sparrow is a central bird, an ostrich a noncentral one. But if you live on the Ostrich World, which is inhabited only by ostriches, emus, and cassowaries, then probably an ostrich seems like a pretty central example of ‘bird’ and the first sparrow you see will be fantastically strange.
Right now most people’s central examples of religion are probably things like your local neighborhood church. If you’re American, it’s probably a bland Protestant denomination like the Episcopalians or something.
The guy whose central examples of religion are Pope Francis and the Dalai Lama is probably going to have a different perception of religion than the guy whose central examples are Torquemada and Fred Phelps. If you convert someone from the first kind of person to the second kind of person, you’ve gone most of the way to making them an atheist.
It’s not a LW-distinctive phrase. Try searching Google News, for instance. It falls out of spatial models of concepts such as prototype theory, e.g. a robin is a central example of a bird while an ostrich is not.
The “all other money moved” bars on the first GiveWell graph (which I think represent donations from individual donors) do look a lot like exponential growth. Except 2015 was way above the trend line (and 2014 & 2016 a bit above too).
If you take the first and last data points (4.1 in 2011 & 83.3 in 2019), it’s a 46% annual growth rate.
If you break it down into four two-year periods (which conveniently matches the various little sub-trends), it’s:
2011-13: 46% annual growth (4.1 to 8.7)2013-15: 123% annual growth (8.7 to 43.4)2015-17: 3% annual growth (43.4 to 45.7)2017-19: 35% annual growth (45.7 to 83.3)
2019 “all other money moved” is exactly where you’d project if you extrapolated the 2011-13 trend, although it does look like the trend has slowed a bit (even aside from the 2015 outlier) since 35% < 46%.
If GiveWell shares the “number of donors” count for each year that trend might be smoother (less influenced by a few very large donations), and more relevant for this question of how much EA has been growing.
Funding from Open Phil / Good Ventures looks more like a step function, with massive ramping up in 2013-16 and then a plateau (with year-to-year noise). Which is what you might expect from a big foundation—they can ramp up spending much faster than what you’d see with organic growth, but that doesn’t represent a sustainable exponential trend (if Good Ventures had kept ramping up at the same rate then they would have run out of money by now).
The GWWC pledge data look like linear growth since 2014, rather than exponential growth or a plateau.
On the whole it looks like there has been growth over the past few years, though the growth rate is lower than it was in 2012-16 and the amount & shape of the growth differs between metrics.
It appears Operation Warp Speed had to be funded by raiding other sources because Congress couldn’t be bothered to fund it. As MR points out, this is a scandal because it was necessary, rather than because it was done. It’s scary, because it implies that under a different administration Operation Warp Speed could easily have not happened at all.
There are gaps in the reporting on Operation Warp Speed funding, because apparently a bunch of the money that Congress did allocate for vaccines hasn’t been spent yet. I don’t understand why the White House spent other money but not that money.
Voting is like donating thousands of dollars to charity
If you care about social impact, why is voting important?
There are advantages to this style of writing even when the general term isn’t contentious.
These kinds of concrete descriptions encourage readers to look at the world and see what’s there, rather than engaging primarily with you and your concepts.
This can be good for people who know less about the topic, since looking at the world has fewer prerequisites. And it can be good for people who know more about the topic, since they can gain texture and depth by looking at new examples.
Though with non-contentious topics it’s easier to add a general term at the end as a label to remember, or to tie the post into a larger conversation, without overshadowing the rest of the post.
Related: Insights from ‘The Strategy of Conflict’
The full-blown process of in-depth contract negotiations, etc., is presumably beyond the scope of the current competitive forecasting arena.
One of the main things that I get out of the sports comparison is that it points to a different way of using (and thinking of) metrics. The obvious default, with forecasting, is to think of metrics as possible scoring rules, where the person with the highest score wins the prize (or appears first on the leaderboard). In that case, it’s very important to pick a good metric, which provides good incentives. An alternative is to treat human judgment as primary, whether that means a committee using its judgment to pick which forecasters win prizes, or forecasters voting on an all-star team, or an employer trying to decide who to hire to do some forecasting for them, or just who has street cred in the forecasting community. And metrics are a way to try to help those people be more informed about forecasters’ abilities & performance, so that they’ll make better judgment. In that case, the standards for what is a good metric to include are very different. (There’s also a third use case for metrics, where the forecaster uses metrics about their own performance to try to get better at forecasting.)
Sports also provide an example of what this looks like in action, what sorts of stats exist, how they’re presented, who came up with them, what sort of work went into creating them, how they evaluate different stats and decide which ones to emphasize, etc. And it seems plausible that similar work could be done with forecasting, since much of that work was done by sports fans who are nerds rather than by the teams; forecasting has fewer fans but a higher nerd density. I did some brainstorming in another comment on some potential forecasting stats which draws a lot of inspiration from that; not sure how much of it is retreading familiar ground.
Here’ s a brainstorm of some possible forecasting metrics which might go in those tables (probably I’m reinventing some wheels here; I know more about existing metrics for sports than for forecasting):
Leading Indicator: get credit for making predictions if the consensus then moves in the same direction over the next hours / days / n predictions (alternate version: only if that movement winds up being towards the true outcome)
Points Relative to Your Expectation: each forecast has an expected score according to that forecast (e.g., if the consensus is 60% and you say 80%, you think there’s a 0.8 chance you’ll gain points for doing better than the consensus and a 0.2 chance you’ll lose points for doing worse than consensus). Report expected score alongside actual score, or report the ratio actual/expected. If that ratio is > 1, that means you’ve been underconfident or (more likely) lucky. Also, expected score is similar to “total number of forecasts”, weighted by boldness of forecasts. You could also have a column for the consensus expected score (in the example: your expected score if there was only a 0.6 chance you’d gain points and a 0.4 chance you’d lose points).
Marginal Contribution to Collective Forecast: have some way of calculating the overall collective forecast on each question (which could be just a simple average, or could involve fancier stuff to try to make it more accurate including putting more weight on some people’s forecasts than others). Also calculate what the overall collective forecast would have been if you’d been absent from that question. You get credit for the size of the difference between those two numbers. (Alternative versions: you only get credit if you moved the collective forecast in the right direction, or you get negative credit if you moved it in the wrong direction.)
Trailblazer Score: use whichever forecasting accuracy metric (e.g. brier score relative to consensus) while only including cases where a person’s forecast differed from the consensus at the time by at least X amount. Relevant in part because there might be different skillsets to noticing that the consensus seems off and adjusting a bit in the right direction vs. coming up with your own forecast and trusting it even if it’s not close to consensus. (And the latter skillset might be relevant if you’re making forecasts on your own without the benefit of having a platform consensus to start from.)
Market Mover: find some way to track which comments lead to people changing their forecasts. Credit those commenters based on how much they moved the market. (alternative version: only if it moved towards truth)
Pseudoprofit: find some way to transform people’s predictions into hypothetical bets against each other (or against the house), track each person’s total profit & total amount “bet”. (I’m not sure if this to different calculations or if it’s just a different gloss on the same calculations.)
Splits: tag each question, and each forecast, with various features. Tags by topic (coronavirus, elections, technology, etc.), by what sort of event it’s about (e.g. will people accomplish a thing they’re trying to do), by amount of activity on the question, by time till event (short term vs. medium term vs. long term markets), by whether the question is binary or continuous, by whether the forecast was placed early vs. middle vs. late in the duration of the question, etc. Be able to show each scoring table only for the subset of forecasts that fit a particular tag.
Predicted Future Rating: On any metric, you can set up formulas to predict what people will score on that metric over the next (period of time / set of markets). A simple way to do that is to just predict future scores on that metric based on past scores on the same metric, with some regression towards the mean, using historical data to estimate the relationship. But there are also more complicated things using past performance on some metrics (especially less noisy ones) to help predict future performance on other metrics. And also analyses to check whether patterns in past data are mostly signal or noise (e.g. if a person appears to have improved over time, or if they have interesting splits). (Finding a way to predict future scores is a good way to come up with a comprehensive metric, since it involves finding an underlying skill from among the noise. And the analyses can also provide information about how important different metrics are, which ones to include in the big table, which ones to make more prominent.)
The thing that I was more surprised by, looking at the scoring system, is that Metaculus is set up as a platform for maintaining a forecast rather than as a place where you make a forecast at a particular time. (If I’m understanding the scoring correctly.)
Metaculus scores your current forecast at each moment, from the moment you first enter a forecast on the question until the moment the question closes. Where “your current forecast” at each moment is the most recent number that you entered, and the only thing that happens when you enter an updated prediction is that for the rest of the moments (until you update it again) “your current forecast” will be a different number. Every moment gets equal weight regardless of whether you last entered a number just now or three weeks ago (except that the very last moment when the question closes gets extra weight).
So it’s not like a literal betting market where you’re buying at the current market price at the moment that you make your forecast. If you don’t keep updating your forecast, then you-at-that-moment is going up against the future consensus forecast.
So the scoring system rewards the activity of entering more questions, and also the activity of updating your forecasts on each of those questions again and again to keep them up-to-date.
There was a lesswrong post about this a while back
I was also imagining the distinctions of
adaptation-executers vs. fitness-maximizers
selection + unconscious reinforcement vs. conscious strategizing
which are similar.
And neither of you voted for it!
Seems like a good thing to check in principle, but my guess is it won’t make much difference for this or other posts. AI posts got about as many nonzero votes as other posts, and the ranking of posts by avg vote is almost the same as the official ranking by total votes.
For the 2019 Review, I think it would’ve helped if you/Rob/others had posted something like this as reviews of the post. Then voters would at least see that you had this take, and maybe people who disagree would’ve replied there which could’ve led to some of this getting hashed out in the comments.