Raising the forecasting waterline (part 1)
Previously: Raising the waterline, see also: 1001 PredictionBook Nights (LW copy), Techniques for probability estimates
Low waterlines imply that it’s relatively easy for a novice to outperform the competition. (In poker, as discussed in Nate Silver’s book, the “fish” are those who can’t master basic techniques such as folding when they have a poor hand, or calculating even roughly the expected value of a pot.) Does this apply to the domain of making predictions? It’s early days, but it looks as if a smallish set of tools—a conscious status quo bias, respecting probability axioms when considering alternatives, considering references classes, leaving yourself a line of retreat, detaching from sunk costs, and a few more—can at least place you in a good position.
A bit of backstory
Like perhaps many LessWrongers, my first encounter with the notion of calibrated confidence was “A Technical Explanation of Technical Explanation”. My first serious stab at publicly expressing my own beliefs as quantified probabilities was the Amanda Knox case—an eye-opener, waking me up to how everyday opinions could correspond to degrees of certainty, and how these had consequences. By the following year, I was trying to improve my calibration for work-related purposes, and playing with various Web sites, like PredictionBook or Guessum (now defunct).
Then the Good Judgment Project was announced on Less Wrong. Like several of us, I applied, unexpectedly got in, and started taking forecasting more seriously. (I tend to apply myself somewhat better to learning when there is a competitive element—not an attitude I’m particularly proud of, but being aware of that is useful.)
The GJP is both a contest and an experimental study, in fact a group of related studies: several distinct groups of researchers (1,2,3,4) are being funded by IARPA to each run their own experimental program. Within each, small or large number of participants have been recruited, allocated to different experimental conditions, and encouraged to compete with each other (or even, as far as I know, for some experimental conditions, collaborate with each other). The goal is to make predictions about “world events”—and if possible to get them more right, collectively, than we would individually.1
Tool 1: Favor the status quo
The first hint I got that my approach to forecasting needed more explicit thinking tools was a blog post by Paul Hewitt I came across late in the first season. My scores in that period (summer 2011 to spring 2012) had been decent but not fantastic; I ended up 5th on my team, which itself placed quite modestly in the contest.
Hewitt pointed out that in general, you could do better than most other forecasters by favoring the status quo outcome.2 This may not quite be on the same order of effectiveness as the poker advice to “err on the side of folding mediocre hands more often”, but it makes a lot of sense, at least for the Good Judgment Project (and possibly for many of the questions we might worry about). Many of the GJP questions refer to possibilities that loom large in the media at a given time, that are highly available—in the sense of the availability heuristic. This results in a tendency to favor forecasts of change from status quo.
For instance, one of the Season 1 questions was “Will Marine LePen cease to be a candidate for President of France before 10 April 2012?” (also on PredictionBook). Just because the question is being asked doesn’t mean that you should assign “yes” and “no” equal probabilities of 50%, or even close to 50%, any more than you should assign 50% to the proposition “I will win the lottery”.
Rather, you might start from a relatively low prior probability that anyone who undertakes something as significant as a bid for national presidency would throw in the towel before the contest even starts. Then, try to find evidence that positively favors a change. In this particular case, there was such evidence - the National Front, of which she was the candidate, consistently reports difficulties rounding up the endorsements required to register a candidate legally. However, only once in the past (1981) had this resulted in their candidate being barred (admittedly a very small sample). It would have been a mistake to weigh that evidence excessively. (I got a good score on that question, compared to the team, but definitely owing to a “home ground advantage” as a French citizen rather than my superior forecasting skills.)
Tool 2: Flip the question around
The next technique I try to apply consistently is respecting the axioms of probability. If the probability of event A is 70%, then the probability of not-A is 30%.
This may strike everyone as obvious… it’s not. In Season 2, several of my team-mates are on record as assigning a 75% probability to the proposition “The number of registered Syrian conflict refugees reported by the UNHCR will exceed 250,000 at any point before 1 April 2013”.
That number was reached today, six months in advance of the deadline. This was clear as early as August. The trend in the past few months has been an increase of 1000 to 2000 a day, and the UNHCR have recently provided estimates that this number will eventually reach 700,000. The kicker is that this number is only the count of people who are fully processed by the UNHCR administration and officially in their database; there are tens of thousands more in the camps who only have “appointments to be registered”.
I’ve been finding it hard to understand why my team-mates haven’t been updating to, maybe not 100%, but at least 99%; and how one wouldn’t see these as the only answers worth considering. At any point in the past few weeks, to state your probability as 85% or 91% (as some have quite recently) was to say, “There is still a one in ten chance that the Syrian conflict will suddenly stop and all these people will go home, maybe next week?.”
This is kind of like saying “There is a one in ten chance Santa Claus will be the one distributing the presents this year.” It feels like a huge “clack”.
I can only speculate as to what’s going on there. Queried for a probability, people are translating something like “Sure, A is happening” into a biggish number, and reporting that. They are totally failing to flip the question around and explicitly consider what it would take for not-A to happen. (Perhaps, too, people have been so strongly cautioned by cautions, from Tetlock and others, against being overconfident that they reflexively shy away from the extreme numbers.)
Just because you’re expressing beliefs as percentages doesn’t mean that you are automatically applying the axioms of probability. Just because you use “75%” as a shorthand for “I’m pretty sure” doesn’t mean you are thinking probabilistically; you must train the skill of seeing that for some events, its complement “25%” also counts as “I’m pretty sure”. The axioms are more important than the use of numbers—in fact for this sort of forecast “91%” strikes me as needlessly precise; increments of 5% are more than enough, away from the extremes.
Tool 3: Reference class forecasting
The order in which I’m discussing these “basics of forecasting” reflects not so much their importance, as the order in which I tend to run through them when encountering a new question. (This might not be the optimal order, or even very good—but that should matter little if the waterline is indeed low.)
Using reference classes was actually part of the “training package” of the GJP. From the linked post comes the warning that “deciding what’s the proper reference class is not straightforward”. And in fact, this tool only applies in some cases, not systematically. One of our recently closed questions was “Will any government force gain control of the Somali town of Kismayo before 1 November 2012?”. Clearly, you could spend quite a while trying to figure out an appropriate reference class here. (In fact, this question also stands as a counter-example to the “Favor status quo” tool, and flipping the question around might not have been too useful either. All these tools require some discrimination.)
On the other hand, it came in rather handy in assessing the short-term question we got late september: “What change will occur in the FAO Food Price index during September 2012?”—with barely two weeks to go before the FAO was to post the updated index in early October. More generally, it’s a useful tool when you’re asked to make predictions regarding a numerical indicator, for which you can observe past data.
The FAO price data can be retrieved as a spreadsheet (.xsl download). Our forecast question divided the outcomes into four: A) an increase of 3% or more, B) an increase of less than 3%, C) a decrease of less than 3%, D) a decrease of more than 3%, E) “no change”—meaning a change too small to alter the value rounded to the nearest integer.
It’s not clear from the chart that there is any consistent seasonal variation. A change of 3% would have been about 6.4 points; since 8/2011 there had been four month-on-month changes of that magnitude, 3 decreases and 1 increase. Based on that reference class, the probability of a small change (B+C+E) came out to about 2⁄3. The probability for “no change” (E) to 1⁄12 - the August price was the same as the July price. The probability for an increase (A+B), roughly the same as for a decrease (C+D). My first-cut forecast allocated the probability mass as follows: 15/30/30/15/10.
However, I figured I did need to apply a correction, based on reports of a drought in the US that could lead to some food shortages. I took 10% probability mass from the “decrease” outcomes and allocated it to the “increase” outcomes. My final forecast was 20/35/25/10/10. I didn’t mess around with it any more than that. As it turned out, the actual outcome was B! My score was bettered by only 3 forecasters, out of a total of 9.
Next up: lines of retreat, ditching sunk costs, loss functions
This post has grown long enough, and I still have 3+ tools I want to cover. Stay tuned for Part 2!
1 The GJP is being run by Phil Tetlock, known for his “hedgehog and fox” analysis of forecasting. At that time I wasn’t aware of the competing groups—one of them, DAGGRE, is run by Robin Hanson (of OB fame) among others, which might have made it an appealing alternate choice if I’d know about it.
2 Unfortunately, the experimental condition Paul belonged to used a prediction market where forecasters played virtual money by “betting” on predictions; this makes it hard to translate the numbers he provides into probabilities. The general point is still interesting.
- 1 Apr 2013 15:21 UTC; 41 points) 's comment on Open Thread, April 1-15, 2013 by (
- Forecasting Newsletter: September 2022. by 12 Oct 2022 16:37 UTC; 23 points) (EA Forum;
- Raising the forecasting waterline (part 2) by 12 Oct 2012 15:56 UTC; 22 points) (
- 1 Nov 2012 16:50 UTC; 4 points) 's comment on Is insider information worth looking for? [LINK] by (
- 6 Feb 2013 9:58 UTC; 4 points) 's comment on What are you working on? February 2013 by (
- 1 Nov 2012 19:25 UTC; 3 points) 's comment on Is insider information worth looking for? [LINK] by (
- 6 Jun 2015 18:14 UTC; 2 points) 's comment on Summary of my Participation in the Good Judgment Project by (
- 1 Jun 2013 15:27 UTC; 2 points) 's comment on Meetup : London—Inaugural Practical Session—June 9th by (
- 11 Oct 2012 23:27 UTC; 1 point) 's comment on Skill: The Map is Not the Territory by (
- 11 Oct 2012 11:34 UTC; 1 point) 's comment on The Martial Art of Rationality by (
GJP has pointed out already that forecasters are not updating as fast as they could. I assume a lot of forecasters are like me in rarely updating their predictions.
(In season 2, out of distaste for the new UI, I’ve barely been participating at all.)
So, when do you predict you’ll post Part 2?
I really, really want to answer “shortly” and leave it at that.
But since you ask, 45% chance I’ll do it tomorrow, 25% over the week-end, 10% monday, and 20% later than monday.
Predictions involving what I’m gonna do are trickier for me, because there’s a feedback loop between the act of making a prediction, and my likelihood of taking the corresponding actions once the prediction has turned them into a public commitment; it’s a complicated one which sometimes triggers procrastination, sometimes increased motivation.
Thanks. Your prediction is now recorded on PredictionBook.
I hope you don’t take it personally, but my estimate that you’ll have the essay ready by tomorrow is lower than yours. Even those who, like Kahneman, know that the inside view yields overoptimistic estimates in cases of this sort tend to rely on it more than they should.
Of course, the fact that I’m making this prediction might also enter into the feedback loop you describe. I suspect the overall effect is that your prediction is now more likely to be true as a consequence of my having publicly given a lower estimate than you did.
Congrats!
By the way, when you say “I really, really want to answer ‘shortly’”, is this just because you sometimes dislike giving precise estimates, or do you think there is sometimes a rational justification for this reluctance? Without having thought about the matter carefully, it seems to me that the only valid reason for abstaining from giving precise estimates is that one’s audience might make assumptions about the reliability of the estimate from the fact that it is expressed in precise language (more precision suggests higher reliability). But provided one gives independent reliability measures (by e.g. being explicit about one’s confidence intervals), can this reluctance still be justified?
It’s because it’s now 3am and I’ve stuck a knife in the back of my tomorrow-self, who will wake up sleep deprived, so that my present-self (with a 1500 word first draft completed) can enjoy the certainty of hitting an estimate which was only that, not a commitment. Hyperbolic discounting is a royal pain.
It’s because I’m a sucker for this kind of thing, as are many of my colleagues working in software development. :-/
Talking about increments of 5% runs counter to my intuitions regarding good thinking about probability estimates. For most purposes, the difference between 90% and 95% is significantly larger than the difference between 50% and 55%. Think in logs.
Yes, near the extremes it makes a difference—but we’re using a Brier scoring rule, averaged over all days a forecast is open. That makes thinking in logs less important − 99% isn’t much worse than 100% on errors. I’ll discuss that in pt.2 under ‘loss function’.
Hooray!
It depends on whether you’re using probabilities epistemically or instrumentally. Changing the probability of A from 90% to 95% doesn’t affect your expected utility any more than changing it from 50% to 55%.
The change in expected utility given constant decisions is the same for any 5% change in probability regardless of where the baseline is for the change. However, that “given constant decisions” criterion may be less likely to hold for a change from 90-95% than it is for a change from 50-55%. If you have to choose whether to risk a negative consequence of not-A in exchange for some benefit, for example, then it matters whether the expected negative utility of not-A just fell by a tenth or by half.
Yeah, that’s why I said “For most purposes”.
I vaguely recall some academic work showing this to be true, or more generally if you’re predicting the variable X_t over time, the previous period’s value tends to be a better predictor than more complicated models. Can anyone confirm/deny my memory? And maybe provide a citation?
This is a theme of multiple papers in the 2001 anthology Principles of Forecasting (a PDF of which is findable online), to give a specific citation.
Thanks! That’s exactly the sort of thing I was looking for, and maybe remembering.
These get called AR(1) models, for autoregressive 1.
Most complicated models that I’m familiar with include both the previous value and other factors (since there is generally more going on than a random walk).
Would expressing these things in terms of odds rather than probability make it easier to avoid this error?
Dunno. I have trouble with odds, for some reason, and rarely if ever think in terms of odds.
[comment deleted]
Well, yes. But ought I believe that a yes/no question I have no idea about is as likely as its negation to have been asked? (Especially if it’s being asked implicitly by a situation, rather than explicitly by a human?)
[comment deleted]
Ratio of true statements to false ones: low. Probability TraderJoe wants to make TheOtherDave look foolish: moderate, slightly on the higher end. Ratio of the probability that giving an obviously false statement an answer of relatively high probability would make TheOtherDave look foolish to the probability that giving an obviously true statement a relatively low probability would make TheOtherDave look foolish: moderately high. Probability that the statement is neither true nor false: low.
Conclusion: أنا من (أمريك is most likely false.
That’s interesting.
I considered a proposition like this, decided the ratio was roughly even, concluded that TraderJoe might therefore attempt to predict my answer (and choose their question so I’d be wrong), decided they’d have no reliable basis on which to do so and would know that, and ultimately discarded the whole line of reasoning.
I figured that it would be more embarrassing to say something like “It is true that I am a sparkly unicorn” than to say “It is false that an apple is a fruit”. Falsehoods are much more malleable, largely as an effect of the fact that there are so many more of them than truths, also because they don’t have to be consistent. Since falsehoods are more malleable it seems that they’d be more likely to be ones used in an attempt to insult someone.
My heuristic in situations with recursive mutual modeling is to assume that everyone else will discard whatever line of reasoning is recursive. I then go one layer deeper into the recursion than whatever the default assumption is. It works well.
Sadly, I appear to lack your dizzying intellect.
I used to play a lot of Rock, Paper, Scissors; I’m pretty much a pro.
It is possible that you may have missed TheOtherDave’s allusion there.
The phrase sounded familiar, but I don’t recognize where it’s from and a Google search for “lack your dizzying intellect” yielded no results.
Wait. Found it. Princess Bride? Is it in the book too, or just the movie?
Read the book years ago, but can’t recall if that phrase is in there. In any case, yes, that’s what I was referring to… it’s my favorite fictional portrayal of recursive mutual modeling.
The one I always think of is Poe’s “The Purloined Letter”:
I wonder if there is an older appearance of this trope or if this is the Ur Example? (*checks TvTropes). The only older one listed is from the Romance of the Three Kingdoms, so Poe’s might be the Ur Example in Western culture.
I’m not sure what this phrase means.
It means making an accurate mental simulation of your opponent’s mental process to predict to which level they will iterate.
Here it is—the classic “battle of wits” scene from The Princess Bride. (This clip cuts off before the explanation of the trick used by the victor.)
Both. [EDITED: oops, no, misread you. Definitely in the movie; haven’t read the book.]
Preempt: None of you have any way of knowing whether this is a lie.
The parent of this comment (yes, this one) is a lie.
The parent of this comment (yes, this one) is a lie.
The parent of this comment is true. On my honor as a rationalist.
I would like people to try to solve the puzzle.
This comment (yes, this one) is true.
I think the solution is that you have no honor as a rationalist.
The solution I had in mind is:
“None of you have any way of knowing whether this is a lie” is false because although you can’t definitively prove what my process is or isn’t you’ll still have access to information that allows you to assess and evaluate whether I was probably telling the truth.
Although “none of you have any way of knowing whether this is a lie” is false and thus my first instance of “the parent of this comment is a lie” seems justified, in reality the first instance of that statement is not true. The first instance of that statement is a lie because although “none of you have any way of knowing whether or not this is true” is false, it does not follow that it was a lie. In actuality, I thought that it was true at the time that I posted it, and only realized afterwards that it was false. There was no intent to deceive.
Therefore the grandparent of this comment is true, the greatgrandparent is true, the greatgreatgrandparent is false, and the greatgreatgreat grandparent is inaccurate.
This whole line of riddling occurred because:
I wanted to confuse people, so they failed to properly evaluate the way I model people.
I wanted to distract people, so they chose not to bother properly evaluating the way I model people.
I wanted to amuse myself by pretending that I was the kind of person who cared about the above two.
I was wondering whether anyone would call me out on any of those.
I’m severely tempted to just continue making replies to myself and see how far down the rabbit hole I can get.
I laughed. The solution involves the relativity of wrong, if that helps.
PBEERPG.
I assume you mean without looking it up.
My answer is roughly the same as TimS’s… it mostly depends on “Would TraderJoe pick a true statement in this context or a false one?” Which in turn mostly depends on “Would a randomly selected LWer pick a true statement in this context or a false one?” since I don’t know much about you as a distinct individual.
I seem to have a prior probability somewhat above 50% for “true”, though thinking about it I’m not sure why exactly that is.
Looking it up, it amuses me to discover that I’m still not sure if it’s true.
This is a perfect situation for a poll.
How probable is it that TraderJoe’s statement, in the parent comment, is true?
[pollid:116]
I voted with what I thought my previous estimate was before I’d checked via rot13.
[comment deleted]
It seems like my guess should be based on how likely I think it is that your are trying to trick me in some sense. I assume you didn’t pick a sentence at random.
[comment deleted]
[comment deleted]
The transliteration does, but the actual Arabic means “V’z Sebz Nzrevpn”.
So in fact TraderJoe’s prediction of 0.5 was a simple average over the two statements given, and everyone else giving a prediction failed to take into account that the answer could be neither “true” nor “false”.
Not according to google translate. Incidentally, that string is particularly easy to uncypher by inspection.
[comment deleted]
Yeah, that’s an interesting discrepancy.
All questions that you encounter will be asked by a human. I get what you mean though, if other humans are asking a human a question then distortions are probably magnified.
Some questions are implicitly raised by a situation. “Is this coffee cup capable of holding coffee without spilling it?”, for example. When I pour coffee into the cup, I am implicitly expressing more than 50% confidence that the answer is “yes”.
What I’m saying is that what’s implicit is a fact about you, not the situation, and the way the question is formed is partially determined by you. I was vague in saying so, however.
I agree that the way the question is formed is partially determined by me. I agree that there’s a relevant implicit fact about me. I disagree that there’s no relevant implicit fact about the situation.
Nothing can be implicit without interpretation, sometimes the apparent implications of a situation are just misguided notions that we have inside our heads. You’re going to have a natural tendency to form your questions in certain ways, and some of these ways will lead you to asking nonsensical questions, such as questions with contradictory expectations.
I agree that the apparent implications of a situation are notions in our heads, and that sometimes those notions are nonsensical and/or contradictory and/or misguided.
That’s only reasonable if some agent is trying to maximize the information content of your answer. The vast majority of possible statements of a given length are false.
Sure, but how often do you see each of the following sentences in some kind of logic discussion: 2+2=3 2+2=4 2+2=5 2+2=6 2+2=7
I have seen the first and third from time to time, the second more frequently than any other, and virtually never see 2+2 = n for n > 5. Not all statements are shown with equal frequency. My guess is that the percentage of the time when “2+2 = x” is written in contexts where the statement is for a true/false logic proposition rather than an equation x = 4 is more common than all other values put together.
That’s surely an artifice of human languages and even so it would depend on whether the statement is mostly structured using “or” or using “and”.
There’s a 1-to-1 mapping between true and false statements (just add ‘the following is false:’ in front of each statement to get the opposite). In a language where ‘the following is false’ is assumed, the reverse would be actual.
I’m not sure your statement is true.
Consider:
The sky is blue.
The sky is red.
The sky is yellow.
The sky is pink.
The sky is not blue. The sky is not red. The sky is not yellow. The sky is not pink.
Anyway, it depends on what you mean by “statement”. The vast majority of all possible strings are ungrammatical, the vast majority of all grammatical sentences are meaningless, and most of the rest refer to different propositions if uttered in different contexts (“the sky is ochre” refers to a true proposition if uttered on Mars, or when talking about a picture taken on Mars).
The typical mode of communication is an attempt to convey information by making true statements. One only brings up false statements in much rarer circustances, such as when one entity’s information contradicts another entity’s information. Thus, an optimized language is one where true statements are high in information.
Otherwise, to communicate efficiently, you’d have to go around making a bunch of statements with an extraneous not above the default for the language, which is wierd.
This has the potential to be trans-human, I think.
But whether a statement is true or false depends on things other than the language itself. (The sentence “there were no aces or kings in the flop” is the same length whether or not there were any aces or kings in the flop.) The typical mode of communication is an attempt to convey information by making true but non-tautological statements (for certain values of “typical”—actually implicatures are often at least as important as truth conditions). So, how would such a mechanism work?
But, on the other hand:
The sky is not blue. The sky is not red. The sky is not yellow. The sky is not pink.
You need to be more specific about what exactly it is I said that you’re disputing—I am not sure what it is that I must ‘consider’ about these statements.
On further consideration, I take it back. I was trying to make the point that “Sky not blue” != “Sky is pink”. Which is true, but does not counter your point that (P or !P) must be true by definition.
It is the case that the vast majority of grammatical statements of a give length are false. But until we have a formal way of saying that statements like “The Sky is Blue” or “The Sky is Pink” are more fundamental than statements like “The Sky is Not Blue” or “The Sky is Not Pink,” you must be correct that this is an artifact of the language used to express the ideas. For example, a language where negation was the default and additional length was needed to assert truth would have a different proportion of true and false statements for any given sentence length.
Also, lots of downvotes in this comment path (on both sides of the discussion). Any sense of why?
It’s true of any language optimized for conveying information. The information content of a statement is reciprocal to it’s prior probability, and therefore more or less proportional to how many other statements of the same form would be false.
In your counter example the information content of a statement in the basic form decreases with length.
Yup. Similarly you don’t assign 50% to the proposition “X will change”, where X is a relatively long-lasting feature of the world around you—long-lasting enough to have been noticed as such in the first place and given rise to the hypothesis that it will change. (In the Le Pen prediction, the important word is “cease”, not “Le Pen” or “election”.)
ETA: what I’m getting at is that nobody gives a damn about the class of question “yes/no question which I have no idea about”. The subthread about these questions is a red herring. When a question comes up about “world events”, you have some idea of the odds for change vs status quo based on the general category of things that the question is about. For instance many GJP questions are of the form “Will Prime Minister of Country X resign or otherwise vacate that position within the next six months?”. Even if you are not familiar with the politics of Country X, you have some grounds for thinking that the “No” side of the question is more likely than the “Yes” side—for having an overall status quo bias on this type of question.
That reminds me of a question about judging predictions: Is there any established method to say “x made n predictions, was underconfident / calibrated properly / overconfident and the quality of the predictions was z”? Assuming the predictions are given as “x will happen (y% confidence)”.
It is easy to make 1000 unbiased predictions about lottery drawings, but this does not mean you are good in making predictions.
Yes: use a scoring rule to rate your predictions, giving you an overall evaluation of their quality. If you use, say, the Brier score, that admits decompositions into separate components, for instance “calibration” and “refinement”; if your “refinement” score was high on the lottery drawings, meaning that you’d assigned higher probabilities of winning to the people who did in fact win (as opposed to correctly calling the probabilities of winning overall), you’d be a suspect for game-rigging or psi powers. ;)
Interesting, thanks, but not exactly what I looked for. As an example, take a simplified lottery: 1 number is drawn out of 10. I can predict “number X will have a probability of 10%” 100 times in a row—this is correct, and will give a good score in all scoring rules. However, those predictions are not interesting.
If I make 100 predictions “a meteorite will hit position X tomorrow (10% confidence)” and 10% of them are correct, those predictions are very interesting—you would expect that I have some additional knowledge (for example, observed an approaching asteroid).
The difference between the examples is the quality of the predictions: Everybody can get correct (unbiased) 10%-predictions for the lottery, but getting enough evidence to make correct 10%-probabilities for asteroid impacts is hard—most predictions for those positions will be way lower.
Help me understand what you’re describing? Below is a stab at working out the math (I’m horrible at math, I have to laboriously work things out with a bc-like program, but I’m more confident in my grasp of the concepts).
The salient feature of your meteorite predictions is location. We can score these forecasts exactly as GJP scores multiple-choice forecasts, as long as they’re well-specified. Let’s refine “hit position X” to “within 10 miles of X”. That translates to roughly a one in a million chance of calling the location correctly (surface area of the Earth divided by a 10-mile radius area is about 10 to the 6). We can make a similar calculation with respect to the probability that a meteorite hits at all; it comes out to roughly one per day on average, so we can simplify and assume exactly one hits every day.
So a forecast that “a meteorite will hit location X tomorrow at 10% confidence” is equivalent to dividing Earth into one million cells, each cell being one possible outcome in a multiple-outcome forecast, and putting 10% probability mass into one cell. Let’s say you distribute the remaining probability evenly among the 999,999 remaining cells. We can now compute your Brier loss function, the sum of squared errors.
Either the meteorite hits X, and your score is .81 (the penalty for predicting an event at 10% confidence that turns out to happen), plus epsilon times one million minus one for the other cells. Or the meteorite hits a different cell, and your Brier score is 1.01 minus epsilon: 1 minus epsilon for hitting a cell that you had predicted would be hit at a probability close to 0, plus .01 for failing to hit X, plus epsilon for failing to hit the other cells.
So, over 100 such events, the expected value of your score ranges from 81 if you have laser-like accuracy, to 101 if you’re just guessing at random. Intermediate values reflect intermediate accuracies. The range of scores is fairly narrow, because your probability mass isn’t very concentrated—only a 10% bump on the “jackpot” cell, the rest spread around the surface of the earth.
If any of the above is wrong (math-wise) or stupid, or misrepresents your model, I’d appreciate knowing. :)
To calculate the Brier score, you used >your< assumption that meteorites have a 1 in a million chance to hit a specfic area. What about events without a natural way to get those assumptions?
Let’s use another example:
Assume that I predict that neither Obama nor Romney will be elected with 95% confidence. If that prediction becomes true, it is amazing and indicates a high predictive power (especially if I make multiple similar predictions and most of them become true).
Assume that I predict that either Obama or Romney will be elected with 95% confidence. If that prediction becomes true, it is not surprising.
Where is the difference? The second event is expected by others. How can we quantify “difference to expectations of others” and include it in the score? Maybe with an additional weight—weight each prediction with the difference from the expectations of others (as mean of the log ratio or something like that).
If the objective is to get better scores than others, then that helps, though it’s not clear to me that it does so in any consistent way (in particular, the strategy to maximize your score and the strategy to get the best score with the highest probability may well be different, and one of them might involve mis-reporting your own degree of belief).
You’re getting this from the “refinement” part of the calibration/refinement decomposition of the Brier score. Over time, your score will end up much higher than others’ if you have better refinement (e.g. from “inside information”, or from a superior methodology), even if everyone is identically (perfectly) calibrated.
This is the difference between a weather forecast derived from looking at a climate model, e.g. I assign 68% probability to the proposition that the temperature today in your city is within one standard deviation of its average October temperature, and one derived from looking out the window.
ETA: what you say about my using an assumption is not correct—I’ve only been making the forecast well-specified, such that the way you said you allocated your probability mass would give us a proper loss function, and simplifying the calculation by using a uniform distribution for the rest of your 90%. You can compute the loss function for any allocation of probability among outcomes that you care to name—the math might become more complicated, is all. I’m not making any assumptions as to the probability distribution of the actual events. The math doesn’t, either. It’s quite general.
I can still make 100000 lottery predictions, and get a good score. I look for a system which you cannot trick in that way. Ok, for each prediction, you can subtract the average score from your score. That should work. Assuming that all other predictions are rational, too, you get an expectation of 0 difference in the lottery predictions.
I think “impact here (10% confidence), no impact at that place (90% confidence)” is quite specific. It is a binary event.
del
On 16 questions currently scored, I’ve done better than the team average at 15. Two of the questions where I outperformed the team by a large margin where the Syrian refugee question, basically a matter of extrapolating a trend and predicting status quo with respect to the conflict, and the Kismayo question, basically a matter of knowing my loss function. I had zero home ground advantage on either question.
Some of my wins resulted purely from general knowledge rather than from having any idea of the specifics of the situation: for instance, in mid-August I answered 40% to “Will Kuwait commence parliamentary elections before 1 October 2012?”, reflecting only status quo bias in that a date for the election had not yet been announced. However, early in September I downgraded this to 10%, because I know that as a rule of thumb it takes at least one month to convene an election. The week before, I went to 5% (and even that was quite a generous margin), while several of my teammates made predictions, after I published mine, of 15%, 19%, 33% and even 51% (!).
This felt like entering a poker tournament where people routinely raise pre-flop with a “beer hand” (seven and two—when you play this, either you’ve had too many beers, or it’s time you have one). Elections aren’t a mysterious thing, we participate in one every so often. You need to print ballots, set up voting booths, audit voter registration records, give people time to campaign on national media, all very mundane stuff. Even dictatorships make at least a half-hearted attempt at this, and it’s not like anyone in Kuwait had any particular interest in meeting an October deadline, this was strictly an internal-to-GJP deadline.
So while this question had to do, ostensibly, with something happening in Kuwait, all you needed to make a call at least as good as mine was background knowledge about extremely mundane, practical stuff that, if I had any hint that you wouldn’t factor that in when making a close-to-home prediction, I wouldn’t trust you with organizing so much as the PTA president election. Maybe a birthday party.
I wouldn’t go so far as to claim that “skill at forecasting macro trends transfer to microeconomic moves”.
But I’d take a stand on “demonstrated incompetence at the most elementary moves of forecasting, in a macro domain, is a strong indicator of likely incompetence at forecasting in any micro domain, other than the few narrow ones you might happen to be good at”.
How does GJP score predictions that change over time?
They compute your Brier score for each day that the question is open, according to what your forecast is on that day, and average over all days.
Suppose you start at 80%, six days pass, you switch to 40% three days before the deadline, and the event doesn’t happen, your score is (6*(0.8)^2+3*(0.4)^2)/9 = .48, which is a so-so score—but an improvement over the .64 that you’d get if you didn’t change your mind.
Yeah. Answering “1%” that “there will be a major earthquake in California during $time_period” a month before the end of $time_period kind-of felt like cheating to me.
In a nutshell, no.
Consider some practicalities. An advantage of forecasting world events is that it permits participation by a much broader population. I could run a forecasting contest on when the city of Paris will complete a construction project on the banks of the Seine, which is “my backyard” compared to Syria. Nobody would bother.
The point is to find out something about how you think, and comparing yourself to other people will yield information that you can’t get by sitting on your own, minding your own business. (On the other hand, there’s nothing preventing you from doing both.)
Finally, I’m not aware that people routinely make explicit, quantified forecasts even about their own business. Rather, it seems plain that most of the time, we think “probable” the things we would like to happen, and as a result fail to plan for contingencies we don’t like to think about.
To go from not forecasting at all to making forecasts in any domain is progress. It would certainly be useful to many to make forecasts about their daily lives (which I now do, a little bit). But let’s imagine this were taught in schools as a life skill: I suspect you would have people practicing precisely on events that they have no control over and that allow interpersonal comparison.
del
Thanks for inspiring the following bit of staircase wit, which might make it into some further version of the post: Tool 0 of forecasting is “forecast”. If you don’t do it, you can’t become better at it.
Gwern prefers PredictionBook—where you can, if you want, record private predictions—to GJP. For my part I prefer GJP, precisely because they ask me questions that might not occur to me otherwise, and the competitive aspect suits me. You could also do just fine by recording your own forecasts in a spreadsheet or a notepad, on whatever topics you like.
Is accuracy what you’re after? Which component of accuracy? I can get perfect calibration by throwing a thousand coin flips and predicting 50% all the time. What I seek is debiasing, making the most of whatever information is available without overweighting any part of it (including my own hunches, feelings and fears); and I’m most vulnerable to bias when there are many moving parts, many of which are hidden from me or unknown to me.
No, tool 0 is more like ‘mind your base rates’ or ‘don’t predict what you would like, predict what you really think would happen’. I dunno where you’re getting Tool 0 as ‘Mind your own business’ from; certainly I or Morendil didn’t write it.
I dunno, did you look into any research?
Per the huge amount of material on Outside View vs Inside View and performance of SPRs already discussed on LW, I would guess quite the opposite.
Do you know that, or are you just guessing, as you said you were before?
Or was your entire comment just an excuse to do an awful lot of rhetorical questions?
I think he’s saying it’s a waste of effort to predict who or what will happen in the world if you can’t exert any control over it. That sort of makes sense because it seems useless to worry about those sort of things, at first. But it’s important to understand the consequences of the actions of other people so that you can react to them, and he didn’t take that into account. So, for example, a French citizen might be interested in knowing who the next US president will be because they’re curious about the implications that has for their business contacts in America.
Buying insurance is a decision that relates to things that may or may not happen, that you have little or no control over: illness, accidents, burglaries, etc. Being able to make informed predictions as to the likelihood of these things is a valuable life skill.
del
Are they different in kind? I’m uncertain.
The distinction seems arbitrary at first glance both because what’s personal for one person is impersonal for another and because causality is causality no matter where it occurs. However, if you meant that they’re different in kind in a more epistemic sense, that they’re different in kind from any particular perspective because of the way that they go through your reasoning process, then that seems plausible.
The question is then what types of data work best and why. You’re likely to have less total amounts of data in Near Mode, but you’ll be working with things that are important to you personally which it seems like evolution would favor (individual selection).
On the other hand, evolution seems to make biases more frequent and more intense when they’re about personal matters. But evolution wouldn’t do this if it hadn’t worked often in the past, so perhaps those biases are good? I think that this is fairly plausible, but I also think that these biases would only be “good” in a reproductive sense and not in the sense of epistemic accuracy. They would move you towards maximizing your social status, not the quality of your predictions. It’s unlikely those would overlap.
How likely is it that people are good at evaluating the credibility of the ideas of specific people? I would say that most people are probably bad at this when seeing others face to face because of things like the halo effect and because credibility is rather easy to fake. I would also say that people are rather good at this otherwise. Are these evaluations still accurate when they interact with social motivations, like rivalry? I would say that they probably end up even worse under those circumstances.
So, I believe that personal events and impersonal events should be considered differently because I believe trying to evaluate the accuracy of the views of specific experts would improve the accuracy of your predictions if and only if you avoided personal familiarity or intimacy with those experts, and that otherwise it would damage your accuracy.
I failed to consider the implications of social motivation for professional accuracy, and a bunch of other stuff.
del
I’m sorry, either I’m misunderstanding you or you misunderstood my comment. I don’t understand what you mean by the phrase “choosing types of data”. I think that although we’re better at dealing with some types of data, that doesn’t mean we should focus exclusively on that type of data. I think that becoming a skilled general forecaster is a very useful thing and something that should be pursued.
What sort of questions did you have in mind?
del
Well, I can give you an argument, though you’ll have to evaluate the strength of it yourself.
Forecasting, in a Bayesian sense, is a matter of repeated application of Bayes’ theorem. In short, I make an observation (B) and then ask—what are the chances of prediction (A), given observation (B)? (‘Prediction’ may be the wrong word, given that I may be predicting something unseen that has already happened). Bayes’ theorem states that this is equal to the following:
The chances of observation B, given prediction A, multiplied by the prior probability of prediction A, divided by the prior probability of observation B
Now, the result of the equation is only as good as the figures you feed into it. In your example of the freelancer, the new freelancer (just starting out) has poor estimates of the probabilities involved, though he can improve these estimates by asking a more experienced freelancer for help. The experienced freelancer, on the other hand, has got a better grasp of the input probabilities, and thus gets a more accurate output probability. The equation works for both large-scale, macro events and small-scale, personal events—the difference is, once again, a matter of the input numbers. For a macro event, you’ll have more people looking at, commenting on, discussing the situation; reading the words of others will improve your estimates of the probabilities involved, and putting better numbers in will get you better numbers out. Also, with macro events, you’re more likely to have more time to sit down with pencil and paper and work it out.
However, predicting macro events will help you to better practice the equation, and thus learn how to apply it more quickly and easily to micro events. Sufficient practice will also help you to more quickly and accurately estimate the result for a given set of inputs. So while it is true that the skill of guessing the input probabilities for macro events may have little to do with the skill of guessing the input probabilites for micro events (though there is some correlation there—the skill of accurately putting figures to the probability may transfer to some degree), the skill of practicing the application of the equation is transferable between the two realms.
To continue his line of argument, evolution has gifted us with social instincts superior to our best attempts at rationality. Allowing bias to have its way with us will make us better off socially than we could be otherwise, provided that certain other conditions are met. Forcing flawed attempts at rationality into our behavior may well just corrupt the success of our instincts.
I think I would sort of believe that, with some caveats. For individuals who are good looking and good conversationalists and who value social success over anything else, it probably makes sense to avoid rationality training, as there’s only a chance it can hurt you. So I agree with him in cases like that. But for other individuals, such as those who are unattractive or who are bad conversationalists or who value things other than social success, rationality might be the best strategy, because there’s only a chance it can help you. Learning about biases can hurt you, similarly, making your ability to predict things more rigorous can do the same.
I’m uncertain as to how much I believe that, but I believe the general idea is at least non-obviously false, and that the idea is ultimately more true than false. I believe most people would not do well if they suddenly started working on improving their rationality and predictive accuracy.
Well, to start with: what evidence do you have at the moment about how well calibrated you are?
The methods that Morendil is discussing here are pretty general forecasting techniques, not limited to a particular domain. Some skills are worth developing, even if you’re practicing them in domains you don’t care about.
Personal example: I was a bio major in college, and I found it very difficult to care about organic chemistry, because we were mostly learning about chemicals that had no biological relevance. Consequently, I didn’t learn it very well, which came back to bite me pretty hard when I took biochemistry.
Are there self consistent ways for people to believe that trickle-down economic policies should be encouraged but also to believe that small businesses are the primary drivers of growth? Many people seem to believe both and I do not understand why.
What’s the connection with the OP? I’m not seeing it...
Oh. Err.
I meant to comment with this in open thread. My mistake.
Sure. Just to pick an obvious example, I might believe that trickle-down economic policies will benefit a subset of the population, and also believe that small businesses are primary growth drivers for the population as a whole, and believe that trickle-down economic policies should be encouraged because I consider benefits to the former subset more valuable than growth.