Great questions! I’ll try to respond to the points in order.
Question 1
The distinction between forecasters/Elizabeth making predictions of her initial distributions or the final mean, was one that was rather confusing. I later wrote some internal notes to think through some implications in more detail. You can see them here.
I have a lot of uncertainty in how to best structure these setups. I think though that for cost effectiveness, Elizabeth’s initial distributions should be seen as estimates given of the correct value, which is what she occasionally later gave. As such, for cost effectiveness we are interested in how well the forecasters did and estimating this correct value, vs. how well she did at estimating this correct value.
Separately, it’s of course apparent that that correct value itself is an estimate, and there’s further theoretical work to be done to best say what it should have been estimating, and empiricle work to be done to get a sense of how well it holds up against even more trustworthy estimates.
I personally don’t regard the cost effectiveness here as that crucial, I’d instead treat much of this experiment as a set of structures that could apply to more important things in other cases. Elizabeth’s time was rather inexpensive compared to other people/procedures we may want to use in the future, and we could also spend fixed costs improving the marginal costs of such a setup.
Question 2
We haven’t talked about this specific thing, but I could definitely imagine it. The general hope is that even without such a split, many splits would happen automatically. One big challenge is to get the splits right. One may initially think that forecaster work should be split by partitions of questions, but this may be pretty suboptimal. It may be that some forecasters have significant comparative advantages to techniques that span across questions; for instance, some people are great at making mathematical models, and others are great at adjusting the tails of distributions to account for common biases. I think of this more as dividing cognitive work based on trading strategies than questions.
There are a whole ton of possible experiments to be done here, because there are many degrees of freedom. Pursuing these in an effective way is one of our main questions. Of course, if we could have forecasters help forecast which experiments would be effective, then that could help bootstrap a process.
Question 3
We’ve come up with a few “rubrics” to evaluate how effective a given question or question set will be. The main factors are things like:
Tractability (How much progress for how many resources can be made? What if all the participants are outside the relevant organizations/work?)
Importance (How likely is this information to be valuable for changing important decisions?)
Risk (How likely is it that this work will really anger someone or lead to significant downsides?)
I think it’s really easy to spend a lot of money predicting ineffective things if you are not careful. Finding opportunities that are EV-positive is a pretty significant challenge here. I think my general intended strategy is a mix of “try a bunch of things” and “try to set up a system so the predictors themselves could predict the rubric elements or similar for a bunch of things they could predict.”
Question 4
Agreed! That said, there are many possible dimensions for “complexity”, so there’s a lot of theoretical and practical work to be done here.
It seems like Ozzie is answering on a more abstract level than the question was asked. There’s a difference between “How valuable will it be to answer question X?” (what Ozzie said) and “How outsourceable is question X?” (what Lawrence’s question was related to).
I think that outsourceability would be a sub-property of Tractability.
In more detail, some properties I imagine to affect outsourceability, are whether the question:
1) Requires in-depth domain knowledge/experience
2) Requires substantial back-and-forth between question asker and question answerer to get the intention right
3) Relies on hard-to-communicate intuitions
4) Cannot easily be converted into a quantitative distribution
5) Has independent subcomponents which can be answered separately and don’t rely on each other to be answered (related to Lawrence point about tractability)
Great questions! I’ll try to respond to the points in order.
Question 1
The distinction between forecasters/Elizabeth making predictions of her initial distributions or the final mean, was one that was rather confusing. I later wrote some internal notes to think through some implications in more detail. You can see them here.
I have a lot of uncertainty in how to best structure these setups. I think though that for cost effectiveness, Elizabeth’s initial distributions should be seen as estimates given of the correct value, which is what she occasionally later gave. As such, for cost effectiveness we are interested in how well the forecasters did and estimating this correct value, vs. how well she did at estimating this correct value.
Separately, it’s of course apparent that that correct value itself is an estimate, and there’s further theoretical work to be done to best say what it should have been estimating, and empiricle work to be done to get a sense of how well it holds up against even more trustworthy estimates.
I personally don’t regard the cost effectiveness here as that crucial, I’d instead treat much of this experiment as a set of structures that could apply to more important things in other cases. Elizabeth’s time was rather inexpensive compared to other people/procedures we may want to use in the future, and we could also spend fixed costs improving the marginal costs of such a setup.
Question 2
We haven’t talked about this specific thing, but I could definitely imagine it. The general hope is that even without such a split, many splits would happen automatically. One big challenge is to get the splits right. One may initially think that forecaster work should be split by partitions of questions, but this may be pretty suboptimal. It may be that some forecasters have significant comparative advantages to techniques that span across questions; for instance, some people are great at making mathematical models, and others are great at adjusting the tails of distributions to account for common biases. I think of this more as dividing cognitive work based on trading strategies than questions.
There are a whole ton of possible experiments to be done here, because there are many degrees of freedom. Pursuing these in an effective way is one of our main questions. Of course, if we could have forecasters help forecast which experiments would be effective, then that could help bootstrap a process.
Question 3
We’ve come up with a few “rubrics” to evaluate how effective a given question or question set will be. The main factors are things like:
Tractability (How much progress for how many resources can be made? What if all the participants are outside the relevant organizations/work?)
Importance (How likely is this information to be valuable for changing important decisions?)
Risk (How likely is it that this work will really anger someone or lead to significant downsides?)
I think it’s really easy to spend a lot of money predicting ineffective things if you are not careful. Finding opportunities that are EV-positive is a pretty significant challenge here. I think my general intended strategy is a mix of “try a bunch of things” and “try to set up a system so the predictors themselves could predict the rubric elements or similar for a bunch of things they could predict.”
Question 4
Agreed! That said, there are many possible dimensions for “complexity”, so there’s a lot of theoretical and practical work to be done here.
Question 3
It seems like Ozzie is answering on a more abstract level than the question was asked. There’s a difference between “How valuable will it be to answer question X?” (what Ozzie said) and “How outsourceable is question X?” (what Lawrence’s question was related to).
I think that outsourceability would be a sub-property of Tractability.
In more detail, some properties I imagine to affect outsourceability, are whether the question:
1) Requires in-depth domain knowledge/experience
2) Requires substantial back-and-forth between question asker and question answerer to get the intention right
3) Relies on hard-to-communicate intuitions
4) Cannot easily be converted into a quantitative distribution
5) Has independent subcomponents which can be answered separately and don’t rely on each other to be answered (related to Lawrence point about tractability)