Bayesian examination

A few months ago, Olivier Bailleux, a Professor of computer science and reader of my book on Bayesianism, sent me an email. He suggested to apply some of the ideas of the book to examine students. He proposed Bayesian examination.

I believe it to be a brilliant idea, which could have an important impact on how many people think. At least, I think that this is surely worth sharing here.

tl;dr Bayesian examinations seem very important to deploy because they incentivize both probabilistic thinking and intellectual honesty. Yet, as argued by Julia Galef in this talk, incentives seem critical to change our thinking habits.

Let’s take an example

Where is the International Olympic Committee?
1. Geneva
2. Lausanne
3. Zurich
4. Lugano

Quite often, students are asked to select one of the four possible answers. But this is arguably pretty bad, for several reasons:
- It makes impossible to distinguish a student who has a hunch from a student who really studied and knew the answer.
- It gives students the habit of self-identifying with a single answer.
- It normalizes deterministic question answering.
- It motivates students to defend the answer they gave (which encourages the motivated reasoning fallacy...).

Instead, Bayesian examination demands that students provide probabilistic answers. In other words they will have to provide percentage for each answer.

In our case, a student, call her Alice, might thus answer
1. 33%
2. 33%
3. 33%
4. 1%
Alice would essentially be formalizing the sentence “I really don’t know but I would be very surprised if Lugano was the right answer”.

Another student, let’s call him Bob, might answer
1. 5%
2. 40%
3. 50%
4. 5%
Bob might be having in mind something like “I know that FIFA and CIO are in Zurich and Lausanne, but I don’t remember which is where; though Zurich is larger so it would make sense for CIO to be in Zurich rather than Lausanne”.

Spoiler: the answer turns out to be Lausanne.

Why naive notation is bad

Now, how would such an exam be scored? One intuitive idea could be that Alice should thus get 0.33 points, while Bob should get 0.4 points. Denoting the probability assigned by a student to answer , and the right answer, this would correspond to giving the student a score equals to .

This would not be a great idea though. The reason for this has to do with incentives. Indeed, it turns out that if the above figures are the credences of Alice and Bob, then Alice and Bob would be incentivized to maximize their expected scores . It turns out that this maximization leads to the following answers.

For Alice:
1. Credence in Geneva, but answers .
2. Credence in Lausanne, but answers .
3. Credence in Zurich, but answers .
4. Credence in Lugano, but answers .

For Bob:
1. Credence in Geneva, but answers .
2. Credence in Lausanne, but answers .
3. Credence in Zurich, but answers .
4. Credence in Lugano, but answers .

In other words, this naive scoring incentivizes the exaggeration of beliefs towards deterministic answer. This is very, very, very, very, very bad (sorry I’m a bit of Bayesian extremist!). This favors polarization, rationalization, groupism and so many other root causes of poor debating.

Indeed, while students may not find out consciously that this exaggeration strategy is optimal, we should expect them to eventually try it and not unconsciously notice that this is not so bad. In particular, this prevents them from valuing the extra-effort of probabilistic thinking.

Fortunately, there are better scoring rules.

Incentive-compatible scoring rules

An incentive-compatible scoring rule is called a proper scoring rule. But I’m not keen on the terminology, as it’s not transparent, so I’ll stick with incentive-compatible scoring rule. Such incentive-compatible scoring rules are such that truth-telling (or rather “credence-telling”) is incentivized.

There are several incentive-compatible scoring rules, like the logarithmic scoring rule () or the spherical scoring rule (). But I think that the most appropriate one may be the quadratic scoring rule, because it is the simplest and easiest for students to verify.

In our case, given that the right answer was Lausanne, the score of a student who answered and is . In other words, for each possibility , the student loses the square of the distance between his answer and the true answer (0% or 100%).

In our case, Alice would win points, while Bob would win points. Of course, the right answer would win 1 point, while any maximally wrong answer like would lose 1 point.

Perhaps more interestingly, a maximally ignorant student who answers 25% to each possibility would win points. This is much better than the expected answer of random deterministic guess, which equals . Exaggerated guesses get greatly penalized. In fact, they yield negative points!

Formally, the quadratic scoring rule equals , where is the basis vector whose entries are all zeros except for the -th coordinate, which is 1. If there are answers, then the maximally ignorant student wins , while the random deterministic guesser wins an expectation of .

Note also that $E[S(q)|p] = ||p||_2^2 - ||q-p||_2^2$, where $p$ is the credence and $q$ is the answer. This is clearly minimal for $q=p$. In fact, interestingly, it is minimal even if we allow $q \in \mathbb R^n$ (i.e. even if we don’t tell students that their probabilistic answers need to add up to 1, then they will eventually learn that this is the way to go). In particular, the honest answer yields an expected score of $E[S] = ||p||_2^2$, which indeed reflects the uncertainty of the student.

Why this is important

Because wrong answers are much more penalized than acknowledging ignorance, students who aim to maximize their scores will likely eventually learn, consciously or not, that guessing deterministic answers is just wrong. They may even learn the habit of second-guessing their intuitions, and to add uncertainty to their first guesses. In terms of rationality, this seems like a huge deal!

Perhaps equally importantly, such Bayesian examinations incentivize students to take on probabilistic reasoning. Students may thereby learn to constantly measure appropriately their levels of confidence, and to reason with (epistemological) uncertainty. As an aspiring Bayesian, this is the part I’m most excited about!

Finally, and probably even more importantly, such examinations incentivize intellectual honesty. This is the habit of trying to be honest, not only with others, but also with ourselves. It’s sometimes said that “a bet is a tax on bullshit”, as argued by Alex Tabarrok. Arguably, Bayesian examinations are even better than a bet. Indeed, in (important) exams, we might be making an even bigger effort than when we put our money where our mouth!

In case you’re still not convinced by the importance of intellectual honesty, I highly recommend this talk by Julia Galef or her upcoming book (as well as, say, Tetlock and Gardner’s Superforecasting book).

Where to go from here

I haven’t had the chance to test these ideas though. I wonder how students and teachers will feel about it. I suspect some pushback early on. But I would also bet that students may eventually appreciate it. To find out, I guess this really needs to be tested out there!

One particular platform that could be a great first step is in MOOCs and other online websites where people enter their answers electronically. If you happen to be working in such areas, or to know people working in these areas, I think it would be great to encourage a trial of Bayesian examinations! And if you do, please send me feedbacks. And please let me test your exams as well :P

Still another approach would be to develop an app to record Bayesian bets that we make, and to compute our incentive-compatible (quadratic?) scores. Gamifying the app might make it more popular. If anyone is keen on developing such an app, I’d be more than eager to test it, and to train my own Bayesian forecasting abilities!

PS : If you’re French-speaking (or motivated to read subtitles), you can also check out the video I made on the same topic.