The problem is that metaculus points reward some non-obvious combination of making good predictions and being active on the platform. I only care about the first of those, so the current points system doesn’t help me much.
I can’t look at a user’s points score and figure out how much I should trust their predictions. Or possibly I could, but only by diving into the small print of how scoring works.
I say that as somebody who uses metaculus and believes it has potential. The points system is definitely a weak point
There’s no single metric or score that is going to capture everything. Metaculus points as the central platform metric were devised to —as danohu says — reward both participation and accuracy. Both are quite important. It’s easy to get a terrific Brier score by cherry-picking questions. (Pick 100 questions that you think have 1% or 99% probability. You’ll get a few wrong but your mean Brier score will be ~(few)*0.01. Log score is less susceptible to this). You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
If you want to know how good a predictor is, points are actually pretty useful IMO, because someone who is near the top of the leaderboard is both accurate and highly experienced. Nonetheless more ways of comparing people to each other would be useful. You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other. None of these will be perfect; there’s simply no single number that will tell you everything you might want — why would there be?
Someone who is near the top of the leaderboard is both accurate and highly experienced
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily top 3 (edit: top 10)). See my comment below for more details.
You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other.
I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
I actually think it’s worth tracking: ConsensusBot should be a user, it should always update continuously to the public consensus prediction in its absence, and it shouldn’t be counted as a prediction, so we can see what it looks like and how it scores.
And there should be a contest to see if anyone can use a rule that looks only at predictions, and does better than ConsensusBot (e.g. by deciding whose predictions to care about more vs. less, or accounting for systematic bias, etc).
You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
I think this is actually backwards (the value goes up as the question’s point value increases), because the relative score is the component responsible for the “positive regardless of resolution” payoffs. Explanation and worked example here: https://blog.rossry.net/metaculus/
You don’t care, but if the goal is to motivate better communal predictions, giving people the incentive to do more predicting seems to make far more sense than having it normed to sum to zero, which would mean that in expectation you only gain points when you outperform the community.
This seems to me to be very non-obvious. Do we want more low-quality low-effort predictions, or less high-quality high-effort predictions? Do we want people to go for the exact correct probability as they see it, or give a shove in the direction they feel strongly about? Do we want people to go around making the actual community prediciton to bank free points? Who will free points motivate versus demotivate? What about the question of who to trust, and whether others would update their models based on the predictions of those who are doing well? Etc.
If I have time a post on the subject would be interesting. Curious if there are writings detailing how it works and the reasoning behind it, or if you’d like to talk about it in a video call or LW meetup, or both.
The scoring system incentivizes predicting your true credence, (gory details here).
I think Metaculus rewarding participation is one of the reasons it has participation. Metaculus can discriminate good predictors from bad predictors because it has their track record (I agree this is not the same as discriminating good/bad predictions). This info is incorporated in the Metaculus prediction, which is hidden by default, but you can unlock with on-site fake currency.
I think Metaculus rewarding participation is one of the reasons it has participation.
PredictionBook also had participation while being public about people’s Brier’s scores. I think the main reason Metaculus has more activity is that it has good curated questions.
There’s also no reason to only have a single public metric. Being able to achieve something like the Superforcaster status on the Good Judgement Project would be valuable to motivate some people.
There was a lesswrong post about this a while back that I can’t find right now, and I wrote a twitter thread on a related topic. I’m not involved with the reasoning behind the structure for GJP or Metaculus, so for both it’s an outside perspective. However, I was recently told there is a significant amount of ongoing internal metaculus discussion about the scoring rule, which, I think, isn’t nearly as bad as it seemed. (But even if there is a better solution, changing the rule now would have really weird impacts on motivation of current users, which is critical to the overall forecast accuracy, and I’m not sure it’s worthwhile for them.)
Given all of that, I’d be happy to chat, or even do a meetup on incentives for metrics and issues generally, but I’m not sure I have time to put together my thoughts more clearly in the next month. But I’d think Ozzie Gooen has even more to usefully say on the topic. (Thinking about it, I’d be really interested in being on or watching a panel discussion of the topic—which would probably make an interesting event.)
So one should interpret the points as a measure of how useful you’ve been to the overall predictions in the platform, and not how good you should be expected to be on a specific question, right?
Not really. Overall usefulness is really about something like covariance with the overall prediction—are you contributing different ideas and models. That would be very hard to measure, while making the points incentive compatible is not nearly as hard to do.
And how well an individual predictor will do, based on historical evidence, is found in comparing their brier to the metaculus prediction on the same set of questions. This is information which users can see on their own page. But it’s not a useful figure unless you’re asking about relative performance, which as an outsider interpreting predictions, you shouldn’t care about—because you want the aggregated prediction.
The problem is that metaculus points reward some non-obvious combination of making good predictions and being active on the platform. I only care about the first of those, so the current points system doesn’t help me much.
I can’t look at a user’s points score and figure out how much I should trust their predictions. Or possibly I could, but only by diving into the small print of how scoring works.
I say that as somebody who uses metaculus and believes it has potential. The points system is definitely a weak point
There’s no single metric or score that is going to capture everything. Metaculus points as the central platform metric were devised to —as danohu says — reward both participation and accuracy. Both are quite important. It’s easy to get a terrific Brier score by cherry-picking questions. (Pick 100 questions that you think have 1% or 99% probability. You’ll get a few wrong but your mean Brier score will be ~(few)*0.01. Log score is less susceptible to this). You can also get a fair number of points for just predicting the community prediction — but you won’t get that many because as a question’s point value increases (which it does with the number of predictions), more and more of the score is relative rather than absolute.
If you want to know how good a predictor is, points are actually pretty useful IMO, because someone who is near the top of the leaderboard is both accurate and highly experienced. Nonetheless more ways of comparing people to each other would be useful. You can look at someone’s track record in detail, but we’re also planning to roll out a more ways to compare people with each other. None of these will be perfect; there’s simply no single number that will tell you everything you might want — why would there be?
I think this unfortunately isn’t true right now, and just copying the community prediction would place very highly (I’m guessing if made as soon as the community prediction appeared and updated every day, easily
top 3(edit: top 10)). See my comment below for more details.I’m very glad to hear this. I really enjoy Metaculus but my main gripe with it has always been (as others have pointed out) a lack of way to distinguish between quality and quantity. I’m looking forward to a more comprehensive selection of metrics to help with this!
I actually think it’s worth tracking: ConsensusBot should be a user, it should always update continuously to the public consensus prediction in its absence, and it shouldn’t be counted as a prediction, so we can see what it looks like and how it scores.
And there should be a contest to see if anyone can use a rule that looks only at predictions, and does better than ConsensusBot (e.g. by deciding whose predictions to care about more vs. less, or accounting for systematic bias, etc).
I think this is actually backwards (the value goes up as the question’s point value increases), because the relative score is the component responsible for the “positive regardless of resolution” payoffs. Explanation and worked example here: https://blog.rossry.net/metaculus/
You don’t care, but if the goal is to motivate better communal predictions, giving people the incentive to do more predicting seems to make far more sense than having it normed to sum to zero, which would mean that in expectation you only gain points when you outperform the community.
This seems to me to be very non-obvious. Do we want more low-quality low-effort predictions, or less high-quality high-effort predictions? Do we want people to go for the exact correct probability as they see it, or give a shove in the direction they feel strongly about? Do we want people to go around making the actual community prediciton to bank free points? Who will free points motivate versus demotivate? What about the question of who to trust, and whether others would update their models based on the predictions of those who are doing well? Etc.
If I have time a post on the subject would be interesting. Curious if there are writings detailing how it works and the reasoning behind it, or if you’d like to talk about it in a video call or LW meetup, or both.
The scoring system incentivizes predicting your true credence, (gory details here).
I think Metaculus rewarding participation is one of the reasons it has participation. Metaculus can discriminate good predictors from bad predictors because it has their track record (I agree this is not the same as discriminating good/bad predictions). This info is incorporated in the Metaculus prediction, which is hidden by default, but you can unlock with on-site fake currency.
PredictionBook also had participation while being public about people’s Brier’s scores. I think the main reason Metaculus has more activity is that it has good curated questions.
There’s also no reason to only have a single public metric. Being able to achieve something like the Superforcaster status on the Good Judgement Project would be valuable to motivate some people.
There was a lesswrong post about this a while back that I can’t find right now, and I wrote a twitter thread on a related topic. I’m not involved with the reasoning behind the structure for GJP or Metaculus, so for both it’s an outside perspective. However, I was recently told there is a significant amount of ongoing internal metaculus discussion about the scoring rule, which, I think, isn’t nearly as bad as it seemed. (But even if there is a better solution, changing the rule now would have really weird impacts on motivation of current users, which is critical to the overall forecast accuracy, and I’m not sure it’s worthwhile for them.)
Given all of that, I’d be happy to chat, or even do a meetup on incentives for metrics and issues generally, but I’m not sure I have time to put together my thoughts more clearly in the next month. But I’d think Ozzie Gooen has even more to usefully say on the topic. (Thinking about it, I’d be really interested in being on or watching a panel discussion of the topic—which would probably make an interesting event.)
Having a meetup on this seems interesting. Will PM people.
https://www.lesswrong.com/posts/tyNrj2wwHSnb4tiMk/incentive-problems-with-current-forecasting-competitions ?
So one should interpret the points as a measure of how useful you’ve been to the overall predictions in the platform, and not how good you should be expected to be on a specific question, right?
Not really. Overall usefulness is really about something like covariance with the overall prediction—are you contributing different ideas and models. That would be very hard to measure, while making the points incentive compatible is not nearly as hard to do.
And how well an individual predictor will do, based on historical evidence, is found in comparing their brier to the metaculus prediction on the same set of questions. This is information which users can see on their own page. But it’s not a useful figure unless you’re asking about relative performance, which as an outsider interpreting predictions, you shouldn’t care about—because you want the aggregated prediction.
You could also check their track record. It has a calibration curve and much more.