Richard_Ngo comments on [Part 1] Amplifying generalist research via forecasting – Models of impact and challenges

Richard_Ngo 3 Jan 2020 19:19 UTC
13 points
0
So the thing I’m wondering here is what makes this “amplification” in more than a trivial sense. Let me think out loud for a bit. Warning: very rambly.
Let’s say you’re a competent researcher and you want to find out the answers to 100 questions, which you don’t have time to investigate yourself. The obvious strategy here is to hire 10 people, get them to investigate 10 questions each, and then pay them based on how valuable you think their research was. Or, perhaps you don’t even need to assign them questions—perhaps they can pick their own questions, and you can factor in how neglected each question was as part of the value-of-research calculation.
This is the standard, “freeform” approach; it’s “amplification” in the same sense that having employees is always amplification. What does the forecasting approach change?
- It gives one specific mechanism for how you (the boss) evaluate the quality of research (by comparison with your own deep dive), and rules out all the others. This has the advantage of simplicity and transparency, but has the disadvantage that you can’t directly give rewards for other criteria like “how well is this explained”. You also can’t reward research on topics that you don’t do deep dives on.
  - This mainly seems valuable if you don’t trust your own ability to evaluate research in an unbiased way. But evaluating research is usually much easier than doing research! In particular, doing research involves evaluating a whole bunch of previous literature.
  - Further, if one of your subordinates thinks you’re systematically biased, then the forecasting approach doesn’t give them a mechanism to get rewarded for telling you that. Whereas in the freeform approach to evaluating the quality of research, you can take that into account in your value calculation.
- It gives one specific mechanism for how you aggregate all the research you receive. But that doesn’t matter very much, since you’re not bound to that—you can do whatever you like with the research after you’ve received it. And in the freeform approach, you’re also able to ask people to produce probability distributions if you think that’ll be useful for you to aggregate their research.
- It might save you time? But I don’t think that’s true in general. Sure, if you use the strategy of reading everyone’s research then grading it, that might take a long time. But since the forecasting approach is highly stochastic (people only get rewards for questions you randomly choose to do a deep dive on) you can be a little bit stochastic in other ways to save time. And presumably there are lots of other grading strategies you could use if you wanted.
Okay, let’s take another tack. What makes prediction markets work?
1. Anyone with relevant information can use that information to make money, if the market is wrong.
2. People can see the current market value.
3. They don’t have to reveal their information to make money.
4. They know that there’s no bias in the evaluation—if their information is good, it’s graded by reality, not by some gatekeeper.
5. They don’t actually have to get the whole question right—they can just predict a short-term market movement (“this stock is currently undervalued”) and then make money off that.
This forecasting setup also features 1 and 2. Whether or not it features 3 depends on whether you (the boss) manage to find that information by yourself in the deep dive. And 4 also depends on that. I don’t know whether 5 holds, but I also don’t know whether it’s important.
So, for the sort of questions we want to ask, is there significant private or hard-to-communicate information?
- If yes, then people will worry that you won’t find it during your deep dive.
- If no, then you likely don’t have any advantage over others who are betting.
- If it’s in the sweet spot where it’s private but the investigator would find it during their deep dive, then people with that private information have the right incentives.
If either of the first two options holds, then the forecasting approach might still have an advantage over a freeform approach, because people can see the current best guess when they make their own predictions. Is that visibility important, for the wisdom of crowds to work—or does it work even if everyone submits their probability distributions independently? I don’t know—that seems like a crucial question.

Anyway, to summarise, I think it’s worth comparing this more explicitly to the most straightforward alternative, which is “ask people to send you information and probability distributions, then use your intuition or expertise or whatever other criteria you like to calculate how valuable their submission is, then send them a proportional amount of money.”
- Ben Goldhaber 5 Jan 2020 18:22 UTC
  6 points
  0
  Parent
  IMO the term “amplification” fits if the scheme results in a 1.) clear efficiency gain and 2.) it’s scalable. This looks like (delivering equivalent results but at a lower cost OR providing better results for an equivalent cost. (cost == $$ & time)), AND (~ O(n) scaling costs).
  For example if there was a group of people who could emulate [Researcher’s] fact checking of 100 claims but do it at 10x speed, then that’s an efficiency gain as we’re doing the same work in less time. If we pump the number to 1000 claims and the fact checkers could still do it at 10x speed without additional overheard complexity, then it’s also scalable. Contrast that with the standard method of hiring additional junior researchers to do the fact checking—I expect it to not be as scalable (“huh we’ve got all these employees now I guess we need an HR department and perf reviews and...:)
  It does seem like a fuzzy distinction to me, and I am mildly concerned about overloading a term that already has an association w/ IDA.
- ozziegooen 4 Jan 2020 17:47 UTC
  5 points
  0
  Parent
  Good points! This covers a lot of ground that we’ve been thinking about.
  
  So the thing I’m wondering here is what makes this “amplification” in more than a trivial sense.
  
  To be honest, I’m really not sure what word is best here. “Amplification” is the word we used for this post. I’ve also thought about calling this sort of thing “Proliferation” after “Instillation” here and have previously referred to this method as Prediction-Augmented Evaluation Systems. I agree that the employee case could also be considered a kind of amplification according to this terminology. If you have preferences or other ideas for names for this, I’d be eager to hear!
  
  but has the disadvantage that you can’t directly give rewards for other criteria like “how well is this explained”. You also can’t reward research on topics that you don’t do deep dives on.
  
  Very true, at least at this stage of development of Foretold. I’ve written some more thinking on this here. Traditional prediction markets don’t do a good job incentivizing participants to share descriptions and research, but ideally future systems would. There are ways we are working on to improve this with Foretold. A very simple setup would be one that gives people points/money for writing comments that are upvoted by important predictors.
  
  I think it’s worth comparing this more explicitly to the most straightforward alternative, which is “ask people to send you information and probability distributions, then use your intuition or expertise or whatever other criteria you like to calculate how valuable their submission is, then send them a proportional amount of money.”
  
  This isn’t incredibly far from what we’re going for, but I think the additional presence of a visible aggregate and the ability for forecasters to learn / compete with each other are going to be useful in expectation. I also would want this to be a very systematized process, because then there is a lot of optimization that could arguably be done. The big downside of forecasting systems is that they are less flexible than free-form solutions, but one big upside is that it may be possible to optimize them in different ways. For instance, eventually there could be significant data science pipelines, and lots of statistics for accuracy and calibration, that would be difficult to attain in free form setups. I think in the short term online forecasting setups will be relatively expensive, but it’s possible that with some work they could become significantly more effective for certain types of problems.
  
  I’d definitely agree that good crowdsourced forecasting questions need to be in some sort of sweet spot of “difficult enough to make external-forecasting useful, but open/transparent enough to make external-forecasting possible.”
- Richard_Ngo 3 Jan 2020 19:53 UTC
  4 points
  0
  Parent
  Actually, the key difference between this and prediction markets is that this has no downside risk, it seems? If you can’t lose money for bad predictions. So you could exploit it by only making extreme predictions, which would make a lot of money sometimes, without losing money in the other cases. Or by making fake accounts to drag the average down.
  - Bird Concept 5 Jan 2020 9:33 UTC
    8 points
    0
    Parent
    It might interest you that there’s quite a nice isomorphism between prediction markets and ordinary forecasting tournaments.
    Suppose you have some proper scoring rule $S (p_{i})$ for predictions $p$ on outcome $i$ . For example, in our experiment we used $S (p_{i}) = ln (p_{i})$ . Now suppose the $t$ :th prediction is paid the difference between their score and the score of the previous participant: $S (p_{i, t}) - S (p_{i, t - 1})$ . Then you basically have a prediction market!
    To make this isomorphism work, the prediction market must be run by an automated market maker which buys and sells at certain prices which are predetermined by a particular formula.
    To see that, let $C (x_{i})$ be the total cost of buying $x$ shares in some possibility $i$ (e.g. Yes or No). If the event happens, your payoff will be $x_{i} - C (x_{i})$ (we’re assuming that the shares just pay $1 if the event happens and $0 otherwise). It follows that the cost of buying further shares—the market price—is $C^{'} (x_{i})$ .
    We require that the market prices can be interpreted as probabilities. This means that the prices for all MECE outcomes must sum to 1, i.e. $\sum_{i \in Ω} C^{'} (x_{i}) = 1$ .
    Now we set your profit from buying x shares in the prediction market be equal to your payout in the forecasting tournament, $x_{i} - C (x_{i}) = S (p_{i})$ . Finally, we solve for $C$ , which specifies how the automated market maker must make its trades. Different scoring rules will give you different $C$ . For example, a logarithmic scoring rule will give: $C (\to x) = b ln (\sum_{i \in Ω} e^{\frac{x_{i}}{b}})$ .
    For more details, see page 54 in this paper, Section 5.3, “Cost functions and Market Scoring Rules”.
  - ozziegooen 4 Jan 2020 17:55 UTC
    2 points
    0
    Parent
    This is why proper scoring rules are important. As long as you are adequately using proper scoring rules, and proper combinations of those scoring rules, then people will be incentivized to predict according to their own beliefs. If we assume that users can’t make account, and are paid in proportion to their performance according to proper scoring rules, then they shouldn’t be able to gain expected earnings by providing overconfident answers.
    
    The log-scoring function we use is a proper scoring rule. The potential winnings if you do a great job are very capped due to this scoring rule.
    
    In this specific experiment we had some trust in the participants and no obviously fake accounts. If we scaled this, fake accounts would be an issue, but there are ways around it. I also would imagine that a more robust system would look something like having users begin with little “trust”; that they would then build up by providing good forecasts. They would only begin being payed as long as they had some threshold of trust; but within that level the proper scoring rules should generally create reasonable incentives.
    - Richard_Ngo 4 Jan 2020 20:19 UTC
      4 points
      0
      Parent
      I have four concerns even given that you’re using a proper scoring rule, which relate to the link between that scoring rule and actually giving people money. I’m not particularly well-informed on this though, so could be totally wrong.
      1. To implement some proper scoring rules, you need the ability to confiscate money from people who predict badly. Even when the score always has the same sign, like you have with log-scoring (or when you add a constant to a quadratic scoring system), if you don’t confiscate money for bad predictions, then you’re basically just giving money to people for signing up, which makes having an open platform tricky.
      2. Even if you restrict signups, you get an analogous problem within a fixed population who’s already signed up: the incentives will be skewed when it comes to choosing which questions to answer. In particular, if people expect to get positive amounts of money for answering randomly, they’ll do so even when they have no relevant information, adding a lot of noise.
      3. If a scoring rule is “very capped”, as the log-scoring function is, then the expected reward from answering randomly may be very close to the expected reward from putting in a lot of effort, and so people would be incentivised to answer randomly and spend their time on other things.
      4. Relatedly, people’s utilities aren’t linear in money, so the score function might not remain a proper one taking that into account. But I don’t think this would be a big effect on the scales this is likely to operate on.
      - ozziegooen 4 Jan 2020 20:44 UTC
        8 points
        0
        Parent
        The fact that we use a “proper scoring rule” definitely doesn’t mean that the entire system, including the participant’s true utility functions, are really “proper”. There is really a fair bit of impropriety. For instance, people also may care about their online reputation, and that won’t be captured in the proper scoring rule. The proper scoring rule really helps make sure that one specific aspect of the system is “proper” according to a simplified model. This is definitely subideal, but I think it’s still good enough for a lot of things. I’m not sure what type of system would be “perfectly proper”.
        
        Prediction markets have their own disadvantages; as participants don’t behave as perfectly rational agents their either. So I won’t claim that the system is “perfectly aligned”, but I will suggest that it seems “decently aligned” compared to other alternatives, with the ability to improve as we (or others with other systems) add further complexity.
        
        If you don’t confiscate money for bad predictions, then you’re basically just giving money to people for signing up, which makes having an open platform tricky.
        
        What was done in this case was that participants were basically paid a fixed fee for participating, with a second “bonus” that was larger, that was paid in proportion to how they did on said rule. This works in experimental settings where we can filter the participants. It would definitely be more work to make the system totally openly available, especially as the prizes increase in value, much for the reason you describe. We’re working to try to figure out solutions that could hold up (somewhat) in these circumstances, but it is tricky, for reasons you suggest and for others.
        
        I’d also point out that having a nice scoring system is one challenge out of many challenges. Having nice probability distribution viewers and editors is difficult. Writing good questions and organizing them, and having software that does this well, is also difficult. This is something that @jacobjacob has been spending a decent amount of time thinking about after this experiment, but I’ve personally been focusing on other aspects.
        
        At least in this experiment, the scoring system didn’t seem like a big bottleneck. The people who submitted who won the most money were generally those who seemed to have given thoughtful and useful probability distributions. Things are much easier when you have an audience who is generally taking things in good faith and who can be excluded from future rounds if it seems appropriate.
        
        Richard_Ngo 4 Jan 2020 22:01 UTC
        6 points
        0
        Parent
        Cool, thanks for those clarifications :) In case it didn’t come through from the previous comments, I wanted to make clear that this seems like exciting work and I’m looking forward to hearing how follow-ups go.
        ozziegooen 4 Jan 2020 22:03 UTC
        2 points
        0
        Parent
        Thanks! I really do appreciate the thoughts & feedback in general, and am quite happy to answer questions. There’s a whole lot we haven’t written up yet, and it’s much easier for me to reply to things than lay everything out.
- Richard_Ngo 3 Jan 2020 19:41 UTC
  4 points
  0
  Parent
  Another point: prediction markets allow you to bet more if you’re more confident the market is off. This doesn’t, except by betting that the market is further off. Which is different. But idk if that matters very much, you could probably recreate that dynamic by letting people weight their own predictions.
  - ozziegooen 4 Jan 2020 17:50 UTC
    3 points
    0
    Parent
    This is definitely a feature we’re considering adding in some form (likely, something like weight/leverage). The current scoring system we are using is quite simple, I expect it to get more sophisticated.
    
    However, one big downside is that sophistication would come with complexity, which could be a lot for some users.
- Bird Concept 5 Jan 2020 10:27 UTC
  3 points
  0
  Parent
  I’ll try to paraphrase you (as well as extrapolating a bit) to see if I get what you’re saying:
  Say you want some research done. The most straightforward way to do so is to just hire a researcher. This “freeform” approach affords a lot of flexibility in how you delegate, evaluate, communicate, reward and aggregate the research. You can build up subtle, shared intuitions with your researchers, and invest a lot of effort in your ability to communicate nuanced and difficult instructions. You can also pick highly independent researchers who are able to make many decisions for themselves in terms of what to research, and how to research it.
  By using “amplification” schemes and other mechanisms, you’re placing significant restrictions on your ability to do all of those things. Hence you better get some great returns to compensate.
  But looking through various ways you might get these benefits, they all seem at best… fine.
  Hence the worry is that despite all the bells-and-whistles, there’s actually no magic happening. This is just like hiring a researcher, but a bit worse. This is only “amplification” in a trivial sense.
  As a corollary, if your research needs seem to be met by a handful in-house researchers, this method wouldn’t be very helpful to you.
  1) Does this capture your views?
  2) I’m curious what you think of the sections: “Mitigating capacity bottlenecks” and “A way for intellectual talent to build and demonstrate their skills”?
  In particular, I didn’t feel like your comment engaged with A) the scalability of the approach, compared to the freeform approach, and B) that it might be used as a “game” for young researchers to build skills and reputation, which seems way harder to do with the freeform approach.