Systematizing Epistemics: Principles for Resolving Forecasts

Davidmanheim29 Mar 2021 20:46 UTC

33 points

Forecasting & Prediction Rationality World Modeling

In a previous post, I discussed many methods for resolving predictions. I want to argue that there is a systematic distinction between rules and principles which I think is valuable.

In short, when making rules, one can front-load intentions by writing details upfront, or back-load work by stating high-level principles and having procedures to decide on details on an as-needed basis*. American accounting systems rely on the former, and international accounting systems (and most law systems) focus more on the latter. I think that the question shouldn’t be implicitly decided by front loading assumptions, which is often the current default. More than that, I think the balance should be better and more explicitly be addressed.

Reframing the Problem

Ozzie Gooen’s new organization, QURI (pronounced the same as “query”) is interested in what he’s started to call “systematizing epistemics,” and he offered an analogy that I found very insightful—accounting. Just like keeping track of money is possible without accounting, keeping track of reality is possible without any systematic approach to epistemics—but it’s harder to communicate or agree about money without standardized accounting systems that talk about the same things the same way.

In the aforementioned post, I discussed a variety of ways to resolve predictions. Here, I want to present a more systematic argument about how to think about prediction systems and resolutions. To make this point, I plan to take a detour into accounting—but don’t worry, the post really is about predictions. I want to lay out the analogy between systematizing epistemics and systematizing accounting (even) more in a different post, but for now I’ll jump to the key point for writing prediction questions and resolving predictions.

Accounting Principles versus Accounting Rules

In financial accounting, which is only half as boring as it sounds, there is a conceptual disagreement between Rule-based or Principle-based methods.

A rule based accounting system has (millions of) rules that get updated and adapted to deal with all of the new ways that clever accountants devise to lie with accounting. That is, a rule based system tries to cover every eventuality. This creates complexity, but still makes sure everything is legible to those who have the necessary expertise to decipher accounting statements. On the other hand, every time someone thinks of a clever new interpretation or hack, it is exploited until new rules, which corporations will lobby against, are developed. Worse, even if there are no loopholes, every time a new law is passed or new financial instrument is created, new loopholes suddenly appear.

A principles based system has essentially the same goals, but instead of trying to account for every scenario and clever trick, it has perhaps a dozen guidelines for what accounting is supposed to do. These are things like consistency, full disclosure, good faith and honesty, dividing entries across appropriate periods of time, and accurate representation of a company’s financial position. The extra flexibility probably makes it harder to stop slight deviations from the best way to do things, and makes comparing financial statements a bit harder, but it also is far easier to do correctly, without a million rules you might have accidentally broken, and also makes it easier to deal with companies finding new and clever ways to cheat. So rules need frequent updates.

In practice, all systems are a combination of the two, but the ideal of each system is different. In a fully principles based system, accountants have far more flexibility, but they will get in trouble if they aren’t doing what they are supposed to do, with the boundaries somewhat vague. If they switch from FIFO to LIFO accounting one year to make the profits look better, they clearly broke the rules. Same thing if they end this financial year on December 18th, so they don’t need to include the big loss they took on December 23rd. Those things aren’t allowed in a rule-based system either, but only because the rules explicitly list things like you need to use LIFO, and all financial years must end on a specific date. The costs of compliance for the rule based system, and the complexity of interpreting most financial statement, are probably higher. On the other hand, the risks of ending up in a gray area are also likely higher.

Where do we use principles, and where do we use rules?

There seem to be two reasons for rule-based systems; trust, and predictability. Trust, because rules are useful when we can’t or don’t want to trust the people making decisions. Predictability, because we can’t translate principles into certainty about outcome, or computer code. And trust, with the attendant flexibility, must exist somewhere in any system. The question is whether it is all front-loaded.

Trust

Who do we trust? If your financial system trusts accountants to be honest, you can give them general guidelines and set them loose to accurately reflect the real financial situation at a company. That allows flexibility to exist towards the end of the process. The fact that companies exert pressure on accountants means that there are pressures to cheat to raise stock prices or to pay less tax. Giving the accountants more freedom to make decisions, when we can’t trust them, is going to be a bad strategy. So the trust is pushed to the earliest stages of systematizing accounting, in the rule development stages. It can still be subverted—accountants developing standards also have conflicts of interest—but it makes failures systemic rather than individual (which has its own downsides).

Somewhat similarly, in a detour I won’t expand upon in detail, we can look at legal systems. Because the legal system trusts judges to some extent, it can give them more latitude. To the extent Congress trusts judges, it can leave laws ambiguous. And to the extent that they are incompetent at writing laws, the same is true. But intentional or inadvertent, this is back-loading the flexibility. The laws can be unclear, which makes people uncertain what is allowed, and it is up to judges to clarify them post-hoc.

Legislators have a different option for front-loading flexibility. That is, to the extent that they trust regulators, they can pass along responsibility for creating detailed rules.

Finally, to the extent that the rulers of a society trust the public, they can just articulate what they think would be nice, and let the public decide. Social norms often operate this way—they change and are not spelled out, and people need to learn them implicitly. And as should be clear for both norms and laws, ambiguity doesn’t work when the group is large and heterogeneous—predictability is limited when you don’t know, or trust, the other people in the group. This leads us to the next point.

Predictability

Beyond questions of trust, we have a question of predictability. If your financial system is principle-based, accounting software is tricky. Not every firm needs to do things the same way, and there will be an unlimited number of customizations needed to manage systems. Even worse is trying to automate any type of review or fraud detection.

Similarly, it is harder to make policing decisions without clear rules. Speed limits are clear rules, banning “unsafe” driving is not. Similarly, speed cameras are easy because a camera can check a single number. Maximum BAC is a clear rule, “impaired” driving is not. You can guess which of these are more often used by police who don’t want to be called on to defend their subjective judgements in court.

But a tradeoff mentioned earlier applies here in spades—explicit rules are fragile, and if they are supposed to conform to the intent, need to be updated more often. And frequent updates push against predictability, since the predictions need to account for the fact that the rules can change. And in fact, it can be worse—they can give a false illusion of predictability.

Principles versus Rules for Predictions

There are (at least) two parties involved in predictions; the predictors, and the readers. Predictors usually want clear rules and no ambiguity. The readers of the prediction—including the writers, the sponsors of prediction questions, the general public, and in the case of a futarchy, the system being controlled—want fidelity with intent, not strict adherence to the letter of the law.

There are often places where the spirit and the letter conflict. When that happens, the clash is unfortunate and unintended. For example, a question may intend to forecast the number of cases of COVID-19 which occurred in the first half of 2020, but end up forecasting the speed of creating and deploying tests. (Or the insanity of the FDA in stopping people from doing so, as the case may be.)

The death of the author approach to forecasts is great for predictors. In that scenario, we have a presumption that the spirit of a question is irrelevant once it’s written down. But for prediction markets to be useful, there should be a balance between principles and rules.

But as happened in accounting in the 1800s, most of the effort for forecasting resolution so far has gone into making rules that work, with the principles being implicit. That’s fine, but better understanding of the role of principles and rules would be valuable.

Predictions Cannot Live by Principles Alone

The past, of course, was akin to a purely principle based system, where we trust informal resolutions and evaluations. Pundits might predict something like “there will be increased Chinese aggression this year,” and grade themselves highly, but they do so no matter what occurs. Prediction markets operationalize this into a rule-based resolution; “there will be a fatality in the South China Sea before the end of the year in a confrontation between different countries,” and resolving that is straightforward, relying on nothing but an object level event. Prediction markets fix the problem.

So we have a thesis, punditry, and an antithesis, predictions. I claim that we are waiting for a synthesis. In my view, that synthesis is creating a clearer principle-based approach for creating, understanding, interpreting, and applying the rules.

The question is what a set of principles that guide the rules for writing and resolving questions, and guide the interpretation in cases of ambiguity, should look like. But more clarity about what these principles look like is needed.

Forecasting Principles—Why, and Which Ones?

In forecasting, the implicit use of principles in place of rules means that interpretation is harder, and predictions are worse.

I think there is broad agreement about many of the principles, but they haven’t been formulated. For example, when writing a prediction question, we care about minimizing ambiguity, having a concrete outcome, relating the question to the actual uncertainty or outcome, consistency with other predictions, and so on. When resolving a question, we care about things like fidelity to the intent and the language of the prediction.

Below, I want to lay out both some of the principles, and the best practices and implications for how they apply.

Some Plausible Principles for Forecasting

Predictions should be resolved.
1. This requires that they be resolvable.
2. Both the prediction period and the resolution time should be specified.
3. The resolution method should be known.
Predictions should be clear
1. Predictors should be able to represent their actual beliefs
2. Predictions should be concrete when possible, rather than verbal.
Scoring should be clear
1. As simple as practical
2. Known to forecasters
3. Incentive-compatible
Questions should attempt to be useful.
1. Parallel other similar questions.
2. Match language and criteria from other sources
3. Have standard formats where possible

Further Thoughts on Applying the Principles

Below, I have expanded on the principles and written commentary. I will be happy to update this section with comments from readers, which will (by default) be attributed to your username.

Predictions are not punditry, and without resolution the incentives for accuracy, and the feedback needed to improve, are hampered. Most of the criteria here are technical, but there are trade-offs between resolution and other valuable principles.

Predictions should be resolved.

This requires that they be resolvable.
- The resolution criteria should be well-specified.
- If relevant or possible, the intent of the question, or guidance for how resolution will occur, should be clear.
  - Ambiguity should be avoided, but by default, when (inevitable) ambiguity arises, intent should guide the resolution. Any guidance about the motive or intent of the question can therefore be an asset. This is especially true when resolutions are based on expert opinion.
- By default, forecasts should be assumed to be about object-level issues.

In cases where ambiguity arises, tortuously interpreting the text in unintended ways is unhelpful. (Example; it says ‘reported/estimated’ not ‘estimated/reported’, so we can infer that if a reported number is available, that should be used instead of an estimate.)

And there are sometimes questions where the technical criteria are not fulfilled, but for reasons unrelated to the intent of the question. (Source X is discontinued and lists an old value, but recommends using alternate source Y in the future—but the resolution says it will use source X.) In such cases, the goal of predicting object level reality, rather than predicting meta-level reporting, should be the dominant concern. This is both useful for question designers, and (per Principle 4,) helps ensure that those using the forecast of the question are getting useful information.

Both the prediction period and the resolution time should be specified.
- These times may not be the same.
- It can be better to leave questions open past when the resolution is known.

Knowing when a question will close is valuable for planning, and for understanding the scoring. Especially in situations where the final prediction is counted for a significant portion of the overall score, if a question closes seconds after an event occurs, there is a windfall gain for anyone who happens to be furiously checking the news and updating just at the right moment. (Side note: there are problems with closing predictions early.)

The resolution method should be known.
- A system or individual should be chosen as the final decision-maker beforehand.
  - Planning beforehand reduces the burden of resolving questions. There is also a practical issue with timely resolution and the burden on the market.
- Automated resolution and scoring is usually better than manual or subjective decisions.

Making choices or automating resolutions in advance can be easier than needing to revisit the questions. And unambiguous, automated criteria can minimize dispute—but unambiguous does not always match the object level issue to be tracked, so a tradeoff exists. “The best published estimate” is fairly unambiguous, but may require arbitration. Despite that, it may be a better and more robust criterion than “the number published by source X”—sometimes, the source is discontinued, or better options for the resolution are discovered or created.

Predictions should be clear

Predictors should be able to represent their actual beliefs
- Specific values are better than choosing ranges, and specifying prediction intervals and probabilities are better than binary triggers.
- Fidelity to the true claimed distributions is valuable, rather than, for example, using pre-specified distributions
  - This desiderata is often difficult to reconcile with clear scoring, since complexity in forecasts generally requires complexity in scoring.
Predictions should be concrete when possible, rather than verbal.
- “Higher” is less clear than “above the current value as of [date], which is [Value].”
- Punditry is both less resolvable, and less valuable, than clear predictions.

Scoring should be clear

As simple as practical
- As noted above, there is a tradeoff between allowing more expressive predictions and simple scoring rules.
Known to forecasters
- Even when resolution criteria are allowed to be ambiguous, scoring should not be.
Incentive-compatible
- Gaming of the system should be minimized.
  - Forecasters should not need to spend time gaming the system to have correct incentives. (Side note: Incentive compatibility is tricky when people are not looking only to maximize their score, or when there are non-trivial costs to predicting.)
  - Also, between being known, and reducing gaming, there are some critical issues with actually building incentive-compatible systems, since incentives differ among forecasters in ways that may make incentive-compatibility incompatible with uniform scoring.

Questions should attempt to be useful.

Parallel other similar questions.
- There is a tension between precision and uniformity.
  - For purposes of this principle, in a US presidential election, “who will win the presidential election” is more uniform with similar question than “who will hold the office of President on Jan 21st.” Similarly, “which candidate will win states with the largest number of electoral college votes” is potentially difficult to compare with “Which candidate will win the electoral college”—for example, if there is a brokered convention.
  - If the tradeoff is significant, it may be better to have a question explicitly about the difference.
Match language and criteria from other sources
- Using identical language and criteria makes both aggregation and resolution easier.
  - As with survey questions, there is a significant amount of variation about how a question is asked that can have important implications for comparing and aggregating predictions. Consistency across platforms and between questions is valuable for promoting this. That means it can and should be promoted unless other overriding concerns exist.
- Clearly highlight when this is not the case.
  - Unless the differences in the edge cases are particularly relevant, a standard format and phrasing should be preferred. And as above, it may be better to have a question explicitly about the difference.
Have standard formats where possible
- Clever new input methods are harder for forecasters to understand and use
- Complex phrasing makes mistakes easier.
  - “Will not X happen” versus “Will X happen.”

These are not a final word, but I think they may be a useful basis for continued discussion. And thanks go to Nuño Sempere and Ozzie Gooen for helpful suggestions and additions.

*) I am grateful to Ozzie Gooen for helping me frame this more clearly.

What links here?

Davidmanheim29 Mar 2021 20:46 UTC

33 points

8 comments11 min readLW link

Forecasting & Prediction Rationality World Modeling

Ericf 1 Apr 2021 0:29 UTC
8 points
Sorry if this is an ignorant question, but wouldn’t it be better to cut off making predictions before the actual resolution?

Predictions made just before an event resolves do not add much value to the world. Having your prediction market “shoot up to 100” in the days or hours before an event is a restatement of other widespread information, in contrast to a 60% chance a month out, which is a better aggregate of diffuse knowledge.
- Davidmanheim 1 Apr 2021 5:41 UTC
  2 points
  Parent
  In many cases, yes. But for some events, the “obvious” answers are not fully clear until well after the event in question takes place—elections, for example.
SimonM 30 Mar 2021 8:00 UTC
8 points
This seems pretty reasonable. In particular I think you’ve very clearly articulated a set of principles which are valuable for prediction. However, I feel like your principles for forecasting seem to explicitly lead to most forecasting needing to be rules based. For example, to begin with you say:
In short, when making rules, one can front-load intentions by writing details upfront, or back-load work by stating high-level principles and having procedures to decide on details on an as-needed basis*. [...] I think that the question shouldn’t be implicitly decided by front loading assumptions, which is often the current default. More than that, I think the balance should be better and more explicitly be addressed.
But then later on you say:
This requires that they be resolvable.
- The resolution criteria should be well-specified.
- If relevant or possible, the intent of the question, or guidance for how resolution will occur, should be clear.
  Ambiguity should be avoided, but by default, when (inevitable) ambiguity arises, intent should guide the resolution. Any guidance about the motive or intent of the question can therefore be an asset. This is especially true when resolutions are based on expert opinion.
- By default, forecasts should be assumed to be about object-level issues.
My first take reading “decide on details on an as-needed basis” was that we could somehow avoid most of the pain which goes alongside writing forecasting questions, but that doesn’t seem to be the case.
Another example would be the need to have explicit rules for scoring.
To give a concrete example, I think Metaculus is pretty far to the “Rules” end of the spectrum in your frame. Despite this, I can’t really see what you’re suggesting would change if they shifted to a more “Principles” based approach. If I had to speculate, I would guess your only “real” change would be something like:
“Where the spirit and letter of a question conflict, the question should be resolved based on the spirit”
This seems pretty reasonable to me (although I don’t feel that you’ve made a compelling case for it).
----
But for prediction markets to be useful, there should be a balance between principles and rules.
So I don’t fully agree with the logic in this section. I agree that often the spirit and letter of a question can disagree, but I don’t necessarily think that prediction markets will be “more useful” if they have a more principles based approach. (At least, I think the downsides from potentially uncertain resolution outweigh the upside from potentially resolving “more correctly”). [To give a concrete example, my comfort level using Polymarket decreased substantially when they (in my opinion) changed their resolution criteria on their US Election market. (When the on-ramp and off-ramp started to creak that was the nail in the coffin as far as I was concerned)]
- Davidmanheim 30 Mar 2021 18:37 UTC
  2 points
  Parent
  I haven’t said, and I don’t think, that the majority of markets and prediction sites get this wrong. I think they navigate this without a clear framework, which I think the post begins providing. And I strongly agree that there isn’t a slam-dunk-no-questions case for principles overriding rules, which the intro might have implied too strongly. I also agree with your point about downsides of ambiguity potentially overriding the benefits of greater fidelity to the intent of a question, and brought it up in the post. Still, excessive focus on making rules on the front end, especially for longer-term questions and ones where the contours are unclear, rather than explicitly being adaptive, is not universally helpful.
  
  And clarifications that need to change the resolution criteria mid-way are due to either bad questions, or badly handled resolutions. At the same time, while there are times that avoiding ambiguity is beneficial, there are also times when explicitly addressing corner cases to make them unambiguous (“if the data is discontinued or the method is changed, the final value posted using the current method will be used”) makes the question worse, rather than better.
  
  Lastly, I agree that one general point I didn’t say, but agree with, was that “where the spirit and letter of a question conflict, the question should be resolved based on the spirit.” I mostly didn’t make an explicit case for this because I think it’s under-specified as a claim. Instead, the three more specific claims I would make are:
  1) When the wording of a question seems ambiguous, the intent should be an overriding reason to choose an interpretation.
  2) When the wording of a question is clear, the intent shouldn’t change the resolution.
  - SimonM 30 Mar 2021 19:19 UTC
    3 points
    Parent
    (I realise everything I’m commenting seems like a nitpick and I do think that what you’ve written is interesting and useful, I just don’t have anything constructive to add on that side of things)
    I don’t like litigating via quotes, but:
    I haven’t said, and I don’t think, that the majority of markets and prediction sites get this wrong.
    and
    More than that, I think the balance should be better and more explicitly be addressed.
    I read the bit I’ve emphasised as saying “prediction sites have got this balance wrong” contradicting your comment saying you think they have it right.
    Still, excessive focus on making rules on the front end, especially for longer-term questions and ones where the contours are unclear, rather than explicitly being adaptive, is not universally helpful.
    I think it’s really hard for this adaptive approach to work when there’s more than a small group of like minded people involved in a forecast. (This is related to my final point):
    1) When the wording of a question seems ambiguous, the intent should be an overriding reason to choose an interpretation.
    2) When the wording of a question is clear, the intent shouldn’t change the resolution.
    The problem for me (with this) is what is “clear” for some people is not clear for others. To give one example of this, the language in this question was completely unambiguous to me (and it’s author) but another predictor found it unclear. (I don’t think this is a particularly good example, but it’s just one which I thought of when trying to think of an example of where some people thought something was ambiguous and some people didn’t).
    - Davidmanheim 31 Mar 2021 14:10 UTC
      2 points
      Parent
      re: “Get this wrong” versus “the balance should be better,” there are two different things that are being discussed. The first is about defining individual questions via clear resolution criteria, which I think is doe well, and the second is about defining clear principles that provide context and inform what types of questions and resolution criteria are considered good form.
      
      A question like “will Democrats pass H.R.2280 and receive 51 votes in the Senate” is very well defined, but super-narrow, and easily resolved “incorrectly” if the bill is incorporated into another bill, or if an adapted bill is proposed by a moderate Republican and passes instead, or passed via some other method, or if it passes but gets vetoed by Biden. But it isn’t an unclear question, and given the current way that Metaculus is run, would probably be the best way of phrasing the question. Still, it’s a sub-par question, given the principles I mentioned. A better one would be “Will a bill such as H.R.2280 limiting or banning straw purchases of firearms be passed by the current Congress and enacted?” It’s much less well defined, but the boundaries are very different. It also uses “passed” and “enacted”, which have gray areas. At the same time, the failure modes are closer to the ones that we care about near the boundary of the question. However, given the current system, this question is obviously worse—it’s harder to resolve, it’s more likely to be ambiguous because a bill that does only some of the thing we care about is passed, etc.
      Still, I agree that the boundaries here are tricky, and I’d love to think more about how to do this better.
axioman 2 Apr 2021 19:09 UTC
3 points
“This desiderata is often difficult to reconcile with clear scoring, since complexity in forecasts generally requires complexity in scoring.”
Can you elaborate on this? In some sense, log-scoring is simple and can be applied to very complex distributions; Are you saying that the this would still be “complex scoring” because the complex forecast needs to be evaluated, or is your point about something different?
- Davidmanheim 3 Apr 2021 19:23 UTC
  2 points
  Parent
  Yeah, that’s true. I don’t recall exactly what I was thinking.
  
  Perhaps it was regarding time-weighting, and the difficulty of seeing what your score will be based on what you predict—but the Metaculus interface handles this well, modulus early closings, which screw lots of things up. Also, log-scoring is tricky when you have both continuous and binary outcomes, since they don’t give similar measures—being well calibrated for binary events isn’t “worth” as much, which seems perverse in many ways.