My PhD thesis: Algorithmic Bayesian Epistemology

Link post

In January, I defended my PhD thesis, which I called Algorithmic Bayesian Epistemology. From the preface:

For me as for most students, college was a time of exploration. I took many classes, read many academic and non-academic works, and tried my hand at a few research projects. Early in graduate school, I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints. I decided that this cluster of problems would be my primary academic focus. I settled on calling the cluster algorithmic Bayesian epistemology: all of the questions I was thinking about involved applying the “algorithmic lens” of theoretical computer science to problems of Bayesian epistemology.

Although my interest in mathematical reasoning about uncertainty dates back to before I had heard of the rationalist community, the community has no doubt influenced and strengthened this interest.

The most striking example of this influence is Scott Aaronson’s blog post Common Knowledge and Aumann’s Agreement Theorem, which I ran into during my freshman year of college.[1] The post made things click together for me in a way that made me more intellectually honest and humble, and generally a better person. I also found the post incredibly intellectually interesting—and indeed, Chapter 8 of my thesis is a follow-up to Scott Aaronson’s academic paper on Aumann’s agreement theorem.

My interest in forecast elicitation and aggregation, while pre-existing, was no doubt influenced by the EA/​rationalist-adjacent forecasting community.

And Chapter 9 of the thesis (work I did at the Alignment Research Center) is no doubt causally downstream of the rationalist community.

Which is all to say: thank you! Y’all have had a substantial positive impact on my intellectual journey.

Chapter descriptions

The thesis contains two background chapters followed by seven technical chapters (Chapters 3-9).

In Chapter 1 (Introduction), I try to convey what exactly I mean by “algorithmic Bayesian epistemology” and why I’m excited about it.

In Chapter 2 (Preliminaries), I give some technical background that’s necessary for understanding the subsequent technical chapters. It’s intended to be accessible to readers with a general college-level math background. While the nominal purpose of Chapter 2 is to introduce the mathematical tools used in later chapters, the topics covered there are interesting in their own right.

Different readers will of course have different opinions about which technical chapters are the most interesting. Naturally, I have my own opinions: I think the most interesting chapters are Chapters 5, 7, and 9, so if you are looking for direction, you may want to tiebreak toward reading those. Here are some brief summaries:

Chapter 3: Incentivizing precise forecasts. You might be familiar with proper scoring rules, which are mechanisms for paying experts for forecasts in a way that incentivizes the experts to report their true beliefs. But there are many proper scoring rules (most famously, the quadratic score and the log score), so which one should you use? There are many perspectives on this question, but the one I take in this chapter is: which proper scoring rule most incentivizes experts to do the most research before reporting their forecast? (See also this blog post I wrote explaining the research.)

Chapter 4: Arbitrage-free contract functions. Now, what if you’re trying to elicit forecasts from multiple experts? If you’re worried about the experts colluding, your problem is now harder. It turns out that if you use the same proper scoring rule to pay every expert, then the experts can collude to all report the same forecast—and then redistribute their rewards—in a way that leaves every expert better off, no matter the outcome, than if they hadn’t colluded. (The term for this sort of collusion is “arbitrage”.) On the other hand, you now have more flexibility, because you can pay each expert in a way that depends on the other experts’ reports. In this chapter, I resolve an open problem by showing how to pay experts in a way that makes such arbitrage impossible.

Chapter 5: Quasi-arithmetic pooling. (One of my favorites.) Let’s say that you’ve used a proper scoring rule to elicit forecasts from multiple experts, and now you want to aggregate them. This chapter’s basic take is that your method of aggregation ought to depend on the scoring rule you used. For instance, the log scoring rule incentivizes experts to “be careful” around extreme probabilities: an expert incentivized with the log scoring rule cares substantially about the difference between a 1% chance and a 0.01% chance, while an expert incentivized with the quadratic scoring rule basically doesn’t care. So if you’re using the log scoring rule for eliciting forecasts, it makes sense to “take low probabilities more seriously” when aggregating the forecasts.

In the chapter, I define quasi-arithmetic (QA) pooling with respect to a proper scoring rule to be a particular method of forecast aggregation that depends on the scoring rule. For example, QA pooling with respect to the quadratic score means averaging the forecasts, while QA pooling with respect to the log score means averaging the log odds. I show that QA pooling has a bunch of nice properties and argue that the connection it establishes between forecast elicitation and forecast aggregation is pretty natural and fundamental.

Chapter 6: Learning weights for logarithmic pooling. QA pooling allows you to put weights on experts, depending on their past performance/​how much you trust them. In Chapter 5, I showed that if the proper scoring rule is bounded, then you can efficiently learn weights for experts, by updating the weights based on the experts’ performance, in a way that lets your aggregates be almost as good as if you had known the right set of weights from the beginning. But the log scoring rule isn’t bounded! This chapter extends this result to the log scoring rule, under the assumption that the experts’ forecasts are calibrated.

Chapter 7: Robust aggregation of substitutable signals. (One of my favorites.) Let’s say that Alice says there’s a 60% chance that it’ll rain tomorrow and Bob says there’s a 70% chance. How should you aggregate these forecasts into a single, all-things-considered number? The obvious answer is that it depends: if Alice knows strictly more information than Bob, you should say 60%. If Bob knows strictly more than Alice, you should say 70%. If their forecasts are based on disjoint pieces of evidence, your answer should be different from if their forecasts are based on heavily overlapping evidence.

But suppose you just don’t know. After all, in practice, you may not have this information. This chapter explores a solution concept called robust aggregation: finding an aggregation strategy that works as well as possible in the worst case over a broad class of possible information structures. The broad class I study here is, roughly speaking, information structures that satisfy informational substitutes, meaning that learning an additional expert’s information is less valuable, the more information you already know (i.e., diminishing marginal returns). One highlight of this chapter is a theoretical justification for the practice of extremization: pushing the aggregate forecast away from the prior (see Jaime Sevilla’s writeup here).

Chapter 8: When does agreement imply accuracy? Suppose that Alice and Bob disagree about the value of some quantity, because they have different private information. How can they reach agreement? They can of course do so by exchanging all of their information, but that can take a really large amount of communication. Scott Aaronson’s 2005 paper shows that Alice and Bob can quickly reach agreement simply by repeatedly exchanging their estimates (Alice tells Bob her estimate; Bob tells Alice his estimate after updating on what Alice just said; Alice tells Bob her estimate after updating on what Bob just said; and so on). However, this agreement could be very superficial: once Alice and Bob agree, the number they agree on might differ substantially from the number they would have agreed on, had they actually exchanged all of their information. This chapter establishes a sufficient condition under which, if Alice and Bob agree, their agreed-upon estimate does reflect the estimate they would reach if they exchanged all of their information.

Chapter 9: Deductive circuit estimation. (One of my favorites.) This chapter represents work I did at the Alignment Research Center around April-May of last year. In Formalizing the presumption of independence, we asked whether the process of deductive reasoning can be formalized. In this chapter, I explore deductive reasoning in the special case of boolean circuits. Informally, a deductive estimation algorithm takes as input a boolean circuit and an explanation of the circuit’s behavior (in some formal language) and outputs an estimate of the fraction of inputs on which the circuit outputs 1. I explore how deductive estimation algorithms ought to behave, and give both positive results (“here’s an efficient deductive estimation algorithm that satisfies linearity and respect for proofs”) and negative results (“but no efficient deductive estimation algorithm additionally satisfies 0-1 boundedness”).

A note on the title

The thesis is called “Algorithmic Bayesian Epistemology”. I think most people here know roughly what I mean by “Bayesian epistemology”.

The word “algorithmic” is more of a term of art, referring to the “algorithmic lens” of theoretical computer science. Quoting from the introduction:

The relationship between Bayesian epistemology and algorithmic Bayesian epistemology is the same as the relationship between game theory and algorithmic game theory, and as the relationship between mechanism design and algorithmic mechanism design.

Mechanism design—traditionally a sub-discipline of economics—asks the question: how can we design systems containing strategic agents pursuing their own incentives, in a way that produces good outcomes? For example, how can we auction off multiple items to multiple bidders in a way that produces the optimal social welfare for the bidders? The traditional answer from economic theory is the Vickrey-Clarke-Groves (VCG) auction, which elicits bids, computes the optimal allocation of items, and charges each bidder based on their externality on the remaining bidders.

Computer scientists find this answer dissatisfying, for a simple reason: computing the optimal allocation is not feasible, from the standpoint of both communication and computation. First, the bidders’ preferences may not be compactly representable, in which case it is infeasible to communicate them to the auctioneer. Second, even if the bidders’ preferences are compactly representable, actually computing the optimal allocation may still be intractable. And so algorithmic mechanism design asks the question: how can we design a computationally and communicationally tractable auction mechanism that attains a large fraction of the optimal social welfare?

Algorithmic mechanism design belongs to a longstanding tradition in theoretical computer science: considering problems from other disciplines through an algorithmic lens.[2] That is, instead of asking for the optimal solution to a problem, computer scientists ask: what is the best solution that can actually be implemented, given real-world (or real-world-inspired) constraints?

Sometimes, these constraints are computational: what is the best solution that can be found in polynomial time? Other times, the constraints are communciational: what is the best solution if parties are limited in how much they can communicate? Other kinds of constraints are also common. For example:

  • Constraints on information. For example, the subfield of online algorithms studies sequential decision making under uncertainty (incomplete information). Often, the goal of an online algorithm is to guarantee a result that is almost as good as the best possible result in hindsight, e.g. the prophet inequality from optimal stopping theory, or no-regret algorithms in online learning.

  • Constraints imposed by the strategic behavior of agents in a system. For example, many computer scientists study the price of anarchy: how much the welfare of a system degrades because of self-interested actors, as compared to a welfare-optimizing central planner.

The study of real-world problems through the algorithmic lens has significantly impacted a variety of disciplines, including molecular biology, ecology, neuroscience, quantum physics, and various social sciences.[3]

And so, algorithmic Bayesian epistemology is simply the application of the algorithmic lens to the discipline of Bayesian epistemology. It is perhaps best to define algorithmic Bayesian epistemology by its examples, but to attempt a loose description:

A question belongs to the field of algorithmic Bayesian epistemology if it involves reasoning about uncertainty from a Bayesian perspective, but under constraints that prevent complete assimilation of all existing information.

  1. ^

    I ran into the post because it was linked from this wonderful Slate Star Codex post, which was also one of the first SSC posts I had run into. I had the classic “oh man, I’ve found my people!” experience that day.

  2. ^

    The term “algorithmic lens” was coined at Berkeley by members of the Theory of Computing research group around the year 2000.

  3. ^

    See Chapter 20 here for a detailed discussion.