Radical Probabilism

abramdemski18 Aug 2020 21:14 UTC

LW: 190 AF: 62

Radical Probabilism Bayes' Theorem Logical Induction Epistemology Probabilistic Reasoning Rationality Problem of Old Evidence Probability & Statistics Conservation of Expected Evidence

This is an expanded version of my talk. I assume a high degree of familiarity with Bayesian probability theory.

Toward a New Technical Explanation of Technical Explanation—an attempt to convey the practical implications of logical induction—was one of my most-appreciated posts, but I don’t really get the feeling that very many people have received the update. Granted, that post was speculative, sketching what a new technical explanation of technical explanation might look like. I think I can do a bit better now.

If the implied project of that post had really been completed, I would expect new practical probabilistic reasoning tools, explicitly violating Bayes’ law. For example, we might expect:

A new version of information theory.
- An update to the “prediction=compression” maxim, either repairing it to incorporate the new cases, or explicitly denying it and providing a good intuitive account of why it was wrong.
- A new account of concepts such as mutual information, allowing for the fact that variables have behavior over thinking time; for example, variables may initially be very correlated, but lose correlation as our picture of each variable becomes more detailed.
New ways of thinking about epistemology.
- One thing that my post did manage to do was to spell out the importance of “making advanced predictions”, a facet of epistemology which Bayesian thinking does not do justice to.
- However, I left aspects of the problem of old evidence open, rather than giving a complete way to think about it.
New probabilistic structures.
- Bayesian Networks are one really nice way to capture the structure of probability distributions, making them much easier to reason about. Is there anything similar for the new, wider space of probabilistic reasoning which has been opened up?

Unfortunately, I still don’t have any of those things to offer. The aim of this post is more humble. I think what I originally wrote was too ambitious for didactic purposes. Where the previous post aimed to communicate the insights of logical induction by sketching broad implications, I here aim to communicate the insights in themselves, focusing on the detailed differences between classical Bayesian reasoning and the new space of ways to reason.

Rather than talking about logical induction directly, I’m mainly going to explain things in terms of a very similar philosophy which Richard Jeffrey invented—apparently starting with his phd dissertation in the 50s, although I’m unable to get my hands on it or other early references to see how fleshed-out the view was at that point. He called this philosophy radical probabilism. Unlike logical induction, radical probabilism appears not to have any roots in worries about logical uncertainty or bounded rationality. Instead it appears to be motivated simply by a desire to generalize, and a refusal to accept unjustified assumptions. Nonetheless, it carries most of the same insights.

Radical Probabilism has not been very concerned with computational issues, and so constructing an actual algorithm (like the logical induction algorithm) has not been a focus. (However, there have been some developments—see historical notes at the end.) This could be seen as a weakness. However, for the purpose of communicating the core insights, I think this is a strength—there are fewer technical details to communicate.

A terminological note: I will use “radical probabilism” to refer to the new theory of rationality (treating logical induction as merely a specific way to flesh out Jeffrey’s theory). I’m more conflicted about how to refer to the older theory. I’m tempted to just use the term “Bayesian”, implying that the new theory is non-Bayesian—this highlights its rejection of Bayesian updates. However, radical probabilism is Bayesian in the most important sense. Bayesianism is not about Bayes’ Law. Bayesianism is, at core, about the subjectivist interpretation of probability. Radical probabilism is, if anything, much more subjectivist.

However, this choice of terminology makes for a confusion which readers (and myself) will have to carefully avoid: confusion between Bayesian probability theory and Bayesian updates. The way I’m using the term, a Bayesian need not endorse Bayesian updates.

In any case, I’ll default to Jeffrey’s term for the opposing viewpoint: dogmatic probabilism. (I will occasionally fall into calling it “classical Bayesianism” or similar.)

What Is Dogmatic Probabilism?

Dogmatic Probabilism is the doctrine that the conditional probability $P (B | A)$ is also how we update probabilistic beliefs: any rational change in beliefs should be explained by a Bayesian update.

We can unpack this a little:

(Dynamic Belief:) A rational agent is understood to have different beliefs over time—call these $P_{1}, P_{2}, P_{3}, . . .$
(Static Rationality:) At any one time, a rational agent’s beliefs $P_{n}$ are probabilistically coherent (obey the Kolmogorov axioms, or a similar axiomatization of probability theory).
(Empiricism:) Reasons for changing beliefs across time are given entirely by observations—that is, propositions which the agent learns.
(Dogmatism of Perception:) Observations are believed with probability one, once learned.
(Rigidity:) Upon observing a proposition $A$ , conditional probabilities $P (B | A)$ are unmodified.

The assumptions minus empiricism imply that an update on observing $A$ is a Bayesian update: if we start with $P_{n}$ and update on $A$ to get $P_{n + 1}$ , then $P_{n + 1} (A)$ must equal 1, and $P_{n + 1} (B | A) = P_{n} (B | A)$ . So we must have $P_{n + 1} (B) = P_{n} (B | A)$ . Then, empiricism says that this is the only kind of update we can possibly have.

What Is Radical Probabilism?

Radical probabilism accepts assumptions #1 and #2, but rejects the rest. (Logical Induction need not follow axiom #2, either, since beliefs at any given time only approximately follow the probability laws—however, it’s not necessary to discuss this complication here. Jeffrey’s philosophy did not attempt to tackle such things.)

Jeffrey seemed uncomfortable with updating to 100% on anything, making dogmatism of perception untenable. A similar view is already popular on LessWrong, but it seems that no one here took the implication and denied Bayesian updates as a result. (Bayesian updates have been questioned for other reasons, of course.) This is a bit of an embarassment. But fans of Bayesian updates reading this are more likely to accept that zero and one are probabilities, rather than give up Bayes.

Fortunately, this isn’t actually the crux. Radical probabilism is a pure generalization of orthodox Bayesianism; you can have zero and one as probabilities, and still be a radical probabilist. The real fun begins not with the rejection of dogmatism of perception, but with the rejection of rigidity and empiricism.

This gives us a view in which a rational update from $P_{n}$ to $P_{n + 1}$ can be almost anything. (You still can’t update from $P_{n} (A) = 0$ to $P_{n + 1} (A) > 0$ .) Simply put, you are allowed to change your mind. This doesn’t make you irrational.

Yet, there are still some rationality constraints. In fact, we can say a lot about how rational agents think in this model. In place of assumptions #3-#5, we assume rational agents cannot be Dutch Booked.

Radical Probabilism and Dutch Books

Rejecting the Dutch Book for Bayesian Updates

At this point, if you’re familiar with the philosophy of probability theory, you might be thinking: wait a minute, isn’t there a Dutch Book argument for Bayesian updates? If radical probabilism accepts the validity of Dutch Book arguments, shouldn’t it thereby be forced into Bayesian updates?

No!

As it turns out, there is a major flaw in the Dutch Book for Bayesian updates. The argument assumes that the bookie knows how the agent will update. (I encourage the interested reader to read the SEP section on diachronic Dutch Book arguments for details.) Normally, a Dutch Book argument requires the bookie to be ignorant. It’s no surprise if a bookie can take our lunch money by getting us to agree to bets when the bookie knows something we don’t know. So what’s actually established by these arguments is: if you know how you’re going to update, then your update had better be Bayesian.

Actually, that’s not quite right: the argument for Bayesian updates also still assumes dogmatism of perception. If we relax that assumption, all we can really argue for is rigidity: if you know how you are going to update, then your update had better be rigid.

This leads to a generalized update rule, called Jeffrey updating (or Jeffrey conditioning).

Generalized Updates

Jeffrey updates keep the rigidity assumption, but reject dogmatism of perception. So, we’re changing the probability of some sentence $A$ to $P (A) = c$ , without changing any $P (B | A)$ . There’s only one way to do this:

P_{n + 1} (B) = c \cdot P_{n} (B | A) + (1 - c) \cdot P_{n} (B | \neg A)

In other words, the Jeffrey update interpolates linearly between the Bayesian update on $A$ and the Bayesian update on $\neg A$ . This generalizes Bayesian updates to allow for uncertain evidence: we’re not sure we just saw someone duck behind the corner, but we’re 40% sure.

If this way of updating seems a bit arbitrary to you, Jeffrey would agree. It offers only a small generalization of Bayes. Jeffrey wants to open up much broader space:

Classifying updates by assumptions made.

As I’ve already said, the rigidity assumption can only be justified if the agent knows how it will update. Philosophers like to say the agent has a plan for updating: “If I saw a UFO land in my yard and little green men come out, I would believe I was hallucinating.” This is something we’ve worked out ahead of time.

A non-rigid update, on the other hand, means you don’t know how you’d react: “If I saw a convincing proof of P=NP, I wouldn’t know what to think. I’d have to consider it carefully.” I’ll call non-rigid updates fluid updates.

For me, fluid updates are primarily about having longer to think, and reaching better conclusions as a result. That’s because my main motivation for accepting a radical-probabilist view is logical uncertainty. Without such a motivation, I can’t really imagine being very interested. I boggle at the fact that Jeffrey arrived at this view without such a motivation.

Dogmatic Probabilist: All I can say is: why??

Richard Jeffrey: I’ve explained to you how the Dutch Book for Bayesian updates fails. What more do you want? My view is simply what you get when you remove the faulty assumptions and keep the rest.

Dogmatic Probabilist (DP): I understand that, but why should anyone be interested in this theory? OK, sure, I CAN make Jeffrey updates without getting Dutch Booked. But why ever would I? If I see a cloth in dim lighting, and update to 80% confident the cloth is red, I update in that way because of the evidence which I’ve seen, which is itself fully confident. How could it be any other way?

Richard Jeffrey (RJ): Tell me one peice of information you’re absolutely certain of in such a situation.

DP: I’m certain I had that experience, of looking at the cloth.

RJ: Surely you aren’t 100% sure you were looking at cloth. It’s merely very probable.

DP: Fine then. The experience of looking at … what I was looking at.

RJ: I’ll grant you that tautologies have probability one.

DP: It’s not a tautology… it’s the fact that I had an experience, rather than none!

RJ: OK, but you are trying to defend the position that there is some observation, which you condition on, which explains your 80% confidence the cloth is red. Conditioning on “I had an experience, rather than none” won’t do that. What proposition are you confident in, which explains your less-confident updates?

DP: The photons hitting my retinas, which I directly experience.

RJ: Surely not. You don’t have any detailed knowledge of that.

DP: OK, fine, the individual rods and cones.

RJ: I doubt that. Within the retina, before any message gets sent to the brain, these get put through an opponent process which sharpens the contrast and colors. You’re not perceiving rods and cones directly, but rather a probabilistic guess at light conditions based on rod and cone activation.

DP: The output of that process, then.

RJ: Again I doubt it. You’re engaging in inner-outer hocus pocus.* There is no clean dividing line before which a signal is external, and after which that signal has been “observed”. The optic nerve is a noisy channel, warping the signal. And the output of the optic nerve itself gets processed at V1, so the rest of your visual processing doesn’t get direct access to it, but rather a processed version of the information. And all this processing is noisy. Nowhere is anything certain. Everything is a guess. If, anywhere in the brain, there were a sharp 100% observation, then the nerves carrying that signal to other parts of the brain would rapidly turn it into a 99% observation, or a 90% observation...

DP: I begin to suspect you are trying to describe human fallibility rather than ideal rationality.

RJ: Not so! I’m describing how to rationally deal with uncertain observations. The source of this uncertainty could be anything. I’m merely giving human examples to establish that the theory has practical interest for humans. The theory itself only throws out unnecessary assumptions from the usual theory of rationality—as we’ve already discussed.

DP: (sigh...) OK. I’m still never going to design an artificial intelligence to have uncertain observations. It just doesn’t seem like something you do on purpose. But let’s grant, provisionally, that rational agents could do so and still be called rational.

RJ: Great.

DP: So what’s this about giving up rigidity??

RJ: It’s the same story: it’s just another assumption we don’t need.

DP: Right, but then how do we update?

RJ: However we want.

DP: Right, but how? I want a constructive story for where my updates come from.

RJ: Well, if you precommit to update in a predictable fashion, you’ll be Dutch-Bookable unless it’s a rigid fashion.

DP: So you admit it! Updates need to be rigid!

RJ: By no means!

DP: But updates need to come from somewhere. Whether you know it or not, there’s some mechanism in your brain which produces the updates.

RJ: Whether you know it or not is a critical factor. Updates you can’t anticipate need not be Bayesian.

DP: Right, but… the point of epistemology is to give guidance about forming rational beliefs. So you should provide some formula for updating. But any formula is predictable. So a formula has to satisfy the rigidity condition. So it’s got to be a Bayesian update, or at least a Jeffrey update. Right?

RJ: I see the confusion. But epistemology does not have to reduce things to a strict formula in order to provide useful advice. Radical probabilism can still say many useful things. Indeed, I think it’s more useful, since it’s closer to real human experience. Humans can’t always account for why they change their minds. They’ve updated, but they can’t give any account of where it came from.

DP: OK… but… I’m sure as hell never designing an artificial intelligence that way.

I hope you see what I mean. It’s all terribly uninteresting to a typical Bayesian, especially with the design of artificial agents in mind. Why consider uncertainty about evidence? Why study updates which don’t obey any concrete update rules? What would it even mean for an artificial intelligence to be designed with such updates?

In the light of logical uncertainty, however, it all becomes well-motivated. Updates are unpredictable not because there’s no rule behind them—nor because we lack knowledge of what exactly that rule is—but because we can’t always anticipate the results of computations before we finish running them. There are updates without corresponding evidence because we can think longer to reach better conclusions, and doing so does not reduce to Bayesian conditioning on the output of some computation. This doesn’t imply uncertain evidence in exactly Jeffrey’s sense, but it does give us cases where we update specific propositions to confidence levels other than 100%, and want to know how to move other beliefs in response. For example, we might apply a heuristic to determine that some number is very very likely to be prime, and update on this information.

Still, I’m very impressed with Jeffrey for reaching so many of the right conclusions without this motivation.

Other Rationality Properties

So far, I’ve emphasized that fluid updates “can be almost anything”. This makes it sound as if there are essentially no rationality constraints at all! However, this is far from true. We can establish some very important properties via Dutch Book.

Convergence

No single update can be condemned as irrational. However, if you keep changing your mind again and again without ever settling down, that is irrational. Rational beliefs are required to eventually move less and less, converging to a single value.

Proof: If there exists a point $p$ which your beliefs forever oscillate around (that is, your belief falls above $p + c$ infinitely often, and falls below $p - c$ infinitely often, for some $c > 0$ ) then a bookie can make money off of you as follows: when your belief is below $p - c$ , the bookie makes a bet in favor of the proposition in question, at $p : (1 - p)$ odds. When your belief is above $p + c$ , the bookie offers to cancel that bet for a small fee. The bookie earns the fee with certainty, since your beliefs are sure to swing down eventually (allowing the bet to be placed) and are sure to swing up some time after that (allowing the fee to be collected). What’s more, the bookie can do this again and again and again, turning you into a money pump.

If there exists no such $p$ , then your beliefs must converge to some value. $□$

Caveat: this is the proof in the context of logical induction. There are other ways to establish convergence in other formalizations of radical probabilism.

In any case, this is really important. This isn’t just a nice rationality property. It’s a nice rationality property which dogmatic probilists don’t have. Lack of a convergence guarantee is one of the main criticisms Frequentists make of Bayesian updates. And it’s a good critique!

Consider a simple coin-tossing scenario, in which we have two hypotheses: $h_{\frac{1}{3}}$ posits that the probability of heads is $\frac{1}{3}$ , and $h_{\frac{2}{3}}$ posits that the probability of heads is $\frac{2}{3}$ . The prior places probability $\frac{1}{2}$ on both of these hypotheses. The only problem is that the true coin probability is $\frac{1}{2}$ . What happens? The probabilities $P (h_{\frac{1}{3}})$ and $P (h_{\frac{2}{3}})$ will oscillate forever without converging.

Proof: The quantity $h e a d s - t a i l s$ will take a random walk as we keep flipping the fair coin. A random walk returns to zero infinitely often (a phenomenon known as gambler’s ruin). At each such point, evidence is evenly balanced between the two hypotheses, so we’ve returned to the prior. Then, the next flip is either heads or tails. This results in a probability of $\frac{1}{3}$ for one of the hypotheses, and $\frac{2}{3}$ for the other. This sequence of events happens infinitely often, so $P (h_{\frac{1}{3}})$ and $P (h_{\frac{2}{3}})$ keep experiencing changes of size at least $\frac{1}{6}$ , never settling down. $□$

Now, the objection to Bayesian updates here isn’t just that oscillating forever looks irrational. Bayesian updates are supposed to help us predict the data well; in particular, you might think they’re supposed to help us minimize log-loss. But here, we would be doing much better if beliefs would converge toward $P (h_{\frac{1}{3}}) = P (h_{\frac{2}{3}}) = \frac{1}{2}$ . The problem is, Bayes takes each new bit of evidence just as seriously as the last. Really, though, a rational agent in this situation should be saying: “Ugh, this again! If I send my probability up, it’ll come crashing right back down some time later. I should skip all the hassle and keep my probability close to where it is.”

In other words, a rational agent should be looking out for Dutch Books against itself, including the non-convergence Dutch Book. Its probabilities should be adjusted to avoid such Dutch Books.

DP: Why should I be bothered by this example? If my prior is as you describe it, I assign literally zero probability to the world you describe—I know the coin isn’t fair. I’m fine with my inference procedure displaying pathological behavior in a universe I’m absolutely confident I’m not in.

RJ: So you’re fine with an inference procedure which performs abysmally in the real world?

DP: What? Of course not.

RJ: But the real world cannot possibly be in your hypothesis space. It’s too big. You can’t explicitly write it down.

DP: Physicists seem to be making good progress.

RJ: Sure, but those aren’t hypotheses which you can directly use to anticipate your experiences. They require too much computation. Anything that can fit in your head, can’t be the real world.

DP: You’re dealing with human frailty again.

RJ: On the contrary. Even idealized agents can’t fit inside a universe they can perfectly predict. To see the contradiction, just let two of them play rock-paper-scissors with each other. Anything that can anticipate what you expect, and then do something else, can’t be in your hypothesis space. But let me try a different angle of attack. Bayesianism is supposed to be the philosophy of subjective probability. Here, you’re arguing as if the prior represented an objective fact about how the universe is. It isn’t, and can’t be.

DP: I’ll deal with both of those points at once. I don’t really need to assume that the actual universe is within my hypothesis space. Constructing a prior over a set of hypotheses guarantees you this: if there is a best element in that class, you will converge to it. In the coin-flip example, I don’t have the objective universe in my set of hypotheses unless I can perfectly predict every coin-flip. But the subjective hypothesis which treats the coin as fair is the best of its kind. In the rock-paper-scissors example, rational players would similarly converge toward treating each other’s moves as random, with $\frac{1}{3}$ probability on each move.

RJ: Good. But you’ve set up the punchline for me: if there is no best element, you lack a convergence guarantee.

DP: But it seems as if good priors usually do have a best element. Using Laplace’s rule of succession, I can predict coins of any bias without divergence.

RJ: What if the coin lands as follows: 5 heads in a row, then 25 tails, then 125 heads, and so on, each run lasting for the next power of five. Then you diverge again.

DP: Ok, sure… but if the coin flips might not be independent, then I should have hypotheses like that in my prior.

RJ: I could keep trying to give examples which break your prior, and you could keep trying to patch it. But we have agreed on the important thing: good priors should have the convergence property. At least you’ve agreed that this is a desirable property not always achieved by Bayes.

DP: Sure.

In the end, I’m not sure who would win the counterexample/patch game: it’s quite possible that there general priors with convergence guarantees. No computable prior has convergence guarantees for “sufficiently rich” observables (ie, observables including logical combinations of observables). However, that’s a theorem with a lot of caveats. In particular, Solomonoff Induction isn’t computable, so might be immune to the critique. And we can certainly get rid of the problem by restricting the observables, EG by conditioning on their sequential order rather than just their truth. Yet, I suspect all such solutions will either be really dumb, or uncomputable.

So there’s work to be done here.

But, in general (ie without any special prior which does guarantee convergence for restricted observation models), a Bayesian relies on a realizability (aka grain-of-truth) assumption for convergence, as it does for some other nice properties. Radical probabilism demands these properties without such an assumption.

So much for technical details. Another point I want to make is that convergence points at a notion of “objectivity” for the radical probabilist. Although the individual updates a radical probabilist makes can go all over the place, the beliefs must eventually settle down to something. The goal of reasoning is to settle down to that answer as quickly as possible. Updates may appear arbitrary from the outside, but internally, they are always moving toward this goal.

This point is further emphasized by the next rationality property: conservation of expected evidence.

Conservation of Expected Evidence

The law of conservation of expected evidence is a dearly beloved Bayesian principle. You’ll be glad to hear that it survives unscathed:

P_{n} (X) = E_{n} P_{m} (X)

In the above, $P_{n} (X)$ is your current belief in some proposition $X$ ; $P_{m} (X)$ is some future belief about $X$ (so I’m assuming $m > n$ ); and $E_{n}$ is the expected value operator according to your current beliefs. So what the equation says is: your current beliefs equal your expected value of your future beliefs. This is just like the usual formulation of no-expected-net-update, except we no longer take the expectation with respect to evidence, since a non-Bayesian update may not be grounded in evidence.

Proof: Suppose $P_{n} (X) \neq E_{n} P_{m} (X)$ . One of the two numbers is higher, and the other lower. Suppose $E_{n} P_{m} (X)$ is the lower number. Then a bookie can buy a certificate paying $ $P_{m} (X)$ on day $m$ ; we will willingly sell the bookie this for $ $E_{n} P_{m} (X)$ . The bookie can also sell us a certificate paying $1 if $X$ , for a price of $ $P_{n} (X)$ . At time $m$ , the bookie gains $ $P_{m} (X)$ due to the first certificate. It can then buy the second certificate back from us for $ $P_{m} (X)$ , using the winnings. Overall, the bookie has now paid $ $E_{n} P_{m} (X)$ to us, but we have paid the bookie $ $P_{n} (X)$ , which we assumed was greater. So the bookie profits the difference.

If $P_{n} (X)$ is the lower number instead, the same strategy works, reversing all buys and sells. $□$

The key idea here is that both a direct bet on $X$ and a bet on $P_{m} (X)$ will be worth $P_{m} (x)$ later, so they’d better have the same price now, too.

I see this property as being even more important for a radical probabilist than it is for a dogmatic probabilist. For a dogmatic probabilist, it’s a consequence of Bayesian conditional probability. For a radical probabilist, it’s a basic condition on rational updates. With updates being so free to go in any direction, it’s an important anchor-point.

Another name for this law is the martingale property. This is a property of many stochastic processes, such as Brownian motion. From wikipedia:

In probability theory, a martingale is a sequence of random variables (i.e., a stochastic process) for which, at a particular time, the conditional expectation of the next value in the sequence, given all prior values, is equal to the present value.

It’s important that a sequence of rational beliefs have this property. Otherwise, future beliefs are different from current beliefs in a predictable way, and we would be better off updating ahead of time.

Actually, that’s not immediately obvious, right? The bookie in the Dutch Book argument doesn’t make money by updating to the future belief faster than the agent, but rather, by playing the agent’s beliefs off of each other.

This leads me to a stronger property, which has the martingale property as an immediate consequence (strong self trust):

P_{n} (X | P_{m} (X) = y) = y

Again I’m assuming $m > n$ . The idea here is supposed to be: if you knew your own future belief, you would believe it already. Furthermore, you believe $X$ and $P_{m} (X)$ are perfectly correlated: the only way you’d have high confidence in $X$ would be if it were very probably true, and the only way you’d have low confidence would be for it to be very probably false.

I won’t try to prove this one. In fact, be wary: this rationality condition is a bit too strong. The condition holds true in the radical-probabilism formalization of Diachronic Coherence and Radical Probabilism by Brian Skyrms, so long as $P_{n} (P_{m} (X) = y) > 0$ (see section 6 for statement and proof). However, Logical Induction argues persuasively that this condition is undesirable in specific cases, and replaces it with a slightly weaker condition (see section 4.12).

Nonetheless, for simplicity, I’ll proceed as if strong self trust were precisely true.

At the end of the previous section, I promised that the current section would further illuminate my remark:

The goal of reasoning is to settle down to that answer as quickly as possible. Updates may appear arbitrary from the outside, but internally, they are always moving toward this goal.

The way radical probabilism allows just about any change when beliefs shift from $P_{n}$ to $P_{n + 1}$ may make its updates seem irrational. How can the update be anything, and still be called rational? Doesn’t that mean a radical probabilist is open to garbage updates?

No. A radical probabilist doesn’t subjectively think all updates are equally rational. A radical probabilist trusts the progression of their own thinking, and also does not yet know the outcome of their own thinking; this is why I asserted earlier that a fluid update can be just about anything (barring the transformation of a zero into a positive probability). However, this does not mean that a radical probabilist would accept a psychedelic pill which arbitrarily modified their beliefs.

Suppose a radical probabilist has a sequence of beliefs $P_{1}, P_{2}, P_{3}, P_{4}, \dots, P_{n}$ . If they thought hard for a while, they could update to $P_{n + 1}$ . On the other hand, if they took the psychedelic pill, their beliefs would be modified to become $Q$ . The sequence would be abruptly disrupted, and go off the rails: $P_{1}, P_{2}, P_{3}, \dots, P_{n}, Q, R, S, \dots$

The radical probabilist does not trust whatever they believe next. Rather, the radical probabilist has a concept of virtuous epistemic process, and is willing to believe the next output of such a process. Disruptions to the epistemic process do not get this sort of trust without reason. (For those familiar with The Abolition of Man, this concept is very reminiscent of his “Tao”.)

On the other hand, a radical probabilist could trust a different process. One person, $P$ , might trust that another person, $Q$ , is better-informed about any subject:

P_{n} (X | Q_{n} (X) = y) = y

This says that $P$ trusts $Q$ on any subject if they’ve had the same amount of time to think. This leaves open the question of what $P$ thinks if $Q$ has had longer to think. In the extreme case, it might be that $P$ thinks $Q$ is better no matter how long $P$ has to think:

\forall_{m, n} P_{m} (X | Q_{n} (X) = y) = y

On the other hand, $P$ and $Q$ can both be perfectly rational by the standards of radical probabilism and not trust each other at all. $P$ might not trust $Q$ ’s opinion no matter how long $Q$ thinks.

(Note, however, that you do get eventual agreement on matters where good feedback is available—much like in dogmatic Bayesianism, it’s difficult for two Bayesians to disagree about empirical predictions for long.)

This means you can’t necessarily replace one “virtuous epistemic process” with another. $P_{1}, P_{2}, P_{3}, \dots$ and $Q_{1}, Q_{2}, Q_{3}, \dots$ might both be perfectly rational by the standards of radical probabilism, and yet the disrupted sequence $P_{1}, P_{2}, P_{3}, Q_{4}, Q_{5}, Q_{6}, \dots$ would not be, because $P_{3}$ does not necessarily trust $Q_{4}$ or subsequent $Q$ s.

Realistically, we can be in this kind of position and not even know what constitutes a virtuous reasoning process by our standards. We generally think that we can “do philosophy” and reach better conclusions. But we don’t have a clean specification of our own thinking process. We don’t know exactly what counts as a virtuous continuation of our thinking vs a disruption.

This has some implications for AI alignment, but I won’t try to spell them out here.

Calibration

One more rationality property before we move on.

One could be forgiven for reading Eliezer’s A Technical Explanation of Technical Explanation and coming to believe that Bayesian reasoners are calibrated. Eliezer goes so far as to suggest that we define probability in terms of calibration, so that what it means to say “90% probability” is that, in cases where you say 90%, the thing happens 9 out of 10 times.

However, the truth is that calibration is a neglected property in Bayesian probability theory. Bayesian updates do not help you learn to be calibrated, any more than they help your beliefs to be convergent.

We can make a sort of Dutch Book argument for calibration: if things happen 9-out-of-ten times when the agent says 80%, then a bookie can place bets with the agent at 85:15 odds and profit in the long run. (Note, however, that this is a bit different from typical Dutch Book arguments: it’s a strategy in which the bookie risks some money, rather than just getting a sure gain. What I can say is that Logical Induction treats this as a valid Dutch Book, and so, we get a calibration property in that formalism. I’m not sure about other formalisations of Radical Probabilism.)

The intuition is similar to convergence: even lacking a hypothesis to explain it, a rational agent should eventually notice “hey, when I say 80%, the thing happens 90% of the time!”. It can then improve its beliefs in future cases by adjusting upwards.

This illustrates “meta-probabilistic beliefs”: a radical probabilist can have informed opinions about the beliefs themselves. By default, a classical Bayesian doesn’t have beliefs-about-beliefs except as a result of learning about the world and reasoning about themselves as a part of the world, which is problematic in the classical Bayesian formalism. It is possible to add second-order probabilities, third-order, etc. But calibration is a case which collapses all those levels, illustrating how the radical probabilist can handle all of this more naturally.

I’m struck by the way calibration is something Bayesians obviously want. The set of people who advocate applying Bayes Law and the set of people who look at calibration charts for their own probabilities has a very significant overlap. Yet, Bayes’ Law does not give you calibration. It makes me feel like more people should have noticed this sooner and made a bigger deal about it.

Bayes From a Distance

Before any more technical details about radical probabilism, I want to take a step back and give one intuition for what’s going on here.

We can see radical probabilism as what a dogmatic Bayesian looks like if you can’t see all the details.

The Rationality of Acquaintances

Imagine you have a roommate who is perfectly rational in the dogmatic sense: this roommate has low-level observations which are 100% confident, and performs a perfect Bayesian update on those observations.

However, observing your roommate, you can’t track all the details of this. You talk to your roommate about some important beliefs, but you can’t track every little Bayesian update—that would mean tracking every sensory stimulus.

From your perspective, your roommate has constantly shifting beliefs, which can’t quite be accounted for. If you are particularly puzzled by a shift in belief, you can discuss reasons. “I updated against getting a cat because I observed a hairball in our neighbor’s apartment.” Yet, none of the evidence discussed is itself 100% confident—it’s at least a little bit removed from low-level sense-data, and at least a little uncertain.

Yet, this is not a big obstacle to viewing your roommate’s beliefs as rational. You can evaluate these beliefs on their own merits.

I’ve heard this model called Bayes-with-a-side-channel. You have an agent updating via Bayes, but part of the evidence is hidden. You can’t give a formula for changes in belief over time, but you can still assert that they’ll follow conservation of expected evidence, and some other rationality conditions.

What Jeffrey proposes is that we allow these dynamics without necessarily positing a side-channel to explain the unpredictable updates. This has an anti-reductionist flavor to it: updates do not have to reduce to observations. But why should we be reductionist in that way? Why would subjective belief updates need to reduce to observations?

(Note that Bayes-with-a-side-channel does not imply conditions such as convergence and calibration; so, Jeffrey’s theory of rationality is more demanding.)

Wetware Bayes

Of course, Jeffrey would say that our relationship with ourselves is much like the roommate in my story. Our beliefs move around, and while we can often give some account of why, we can’t give a full account in terms of things we’ve learned with 100% confidence. And it’s not simply because we’re a Bayesian reasoner who lacks introspective access to the low-level information. The nature of our wetware is such that there isn’t really any place you can point to and say “this is a 100% known observation”. Jeffrey would go on to point out that there’s no clean dividing line between external and internal, so you can’t really draw a boundary between external event and internal observation-of-that-event.

(I would remark that Jeffrey doesn’t exactly give us a way to handle that problem; he just offers an abstraction which doesn’t chafe on that aspect of reality so badly.)

Rather than imagining that there are perfect observations somewhere in the nervous system, we can instead imagine that a sensory stimulus exerts a kind of “evidential pressure” which can be less than 100%. These evidential pressures can also come from within the brain, as is the case with logical updates.

But Where Do Updates Come From?

Dogmatic probabilism raises the all-important question “where do priors come from?”—but once you answer that, everything else is supposed to be settled. There have been many debates about what constitutes a rational prior.

Q. How can I find the priors for a problem?
A. Many commonly used priors are listed in the Handbook of Chemistry and Physics.

Q. Where do priors originally come from?
A. Never ask that question.

Q. Uh huh. Then where do scientists get their priors?
A. Priors for scientific problems are established by annual vote of the AAAS. In recent years the vote has become fractious and controversial, with widespread acrimony, factional polarization, and several outright assassinations. This may be a front for infighting within the Bayes Council, or it may be that the disputants have too much spare time. No one is really sure.

Q. I see. And where does everyone else get their priors?
A. They download their priors from Kazaa.

Q. What if the priors I want aren’t available on Kazaa?
A. There’s a small, cluttered antique shop in a back alley of San Francisco’s Chinatown. Don’t ask about the bronze rat.
-- Eliezer Yudkowsky, An Intuitive Explanation of Bayes’ Theorem

Radical probabilists put less emphasis on the prior, since a radical probabilist can effectively “decide to have a different prior” (updating their beliefs as if they’d swapped out one prior for another). However, they face a similarly large problem of where updates come from.

We are given a picture in which beliefs are like a small particle in a fluid, reacting to all sorts of forces (some strong and some weak). Its location gradually shifts as a result of Brownian motion. Presumably, the interesting work is being done behind the scenes, by whatever is generating these updates. Yet, Jeffrey’s picture seems to mainly be about the dance of the particle, while the fluid around it remains a mystery.

A full answer to that question is beyond the scope of this post. (Logical Induction offers one fully detailed answer to that question.) However, I do want to make a few remarks on this problem.

It might at first seem strange for beliefs to be so radically malleable to external pressures. But, actually, this is already the familiar Bayesian picture: everything happens due to externally-driven updates.
Bayesian updates don’t really answer the question of where updates come from, either. They take it as given that there are some “observations”. Radical probabilism simply allows for a more general sort of feedback for learning.
An orthodox probabilist might answer this challenge by saying something like: when we design an agent, we design sensors for it. These are connected in such a way as to feed in sensory observations. A radical probabilist can similarly say: when we design an agent, we get to decide what sort of feedback it uses to improve its beliefs.

The next section will give some practical, human examples of non-Bayesian updates.

Virtual Evidence

Bayesian updates are path-independent: it does not matter in what order you apply them. If you first learn $A$ and then learn $B$ , your updated probability distribution is $P_{3} (X) = P_{2} (X | B) = P_{1} (X | A & B)$ . If you learn these facts the other way around, it’s still $P_{3} (X) = P_{2} (X | A) = P_{1} (X | A & B)$ .

Jeffrey updates are path-dependent. Suppose my probability distribution is as follows:

	A	¬A
B	30%	20%
¬B	20%	30%

I then apply the Jeffrey update P(B)=60%:

	A	¬A
B	36%	24%
¬B	16%	24%

Now I apply P(A)=60%:

	A	¬A
B	41.54%	20%
¬B	18.46%	20%

Since this is asymmetric, but the initial distribution was symmetric, obviously this would turn out differently if we had applied the Jeffrey updates in a different order.

Jeffrey considered this to be a bug—although he seems fine with path-dependence under some circumstances, he used examples like the above to motivate a different way of handling uncertain evidence, which I’ll call virtual evidence. (Judea Pearl strongly advocated virtual evidence over Jeffrey’s rule near the beginning of Probabilistic Reasoning in Intelligent Systems (Section 2.2.2 and 2.3.3), in what can easily be read as a critique of Jeffrey’s theory—if one does not realize that Jeffrey is largely in agreement with Pearl. I thoroughly recommend Pearl’s discussion of the details.)

Recall the basic anatomy of a Bayesian update:

The idea of virtual evidence is to use evidence ‘e’ which is not an event in our event space. We’re just acting as if there were evidence ‘e’ which justifies our update. Terms such as P(e), P(e&h), P(e|h), P(h|e), and so on are not given the usual probabilistic interpretation; they just stand as a convenient notation for the update. All we need to know is the likelihood function for the update. We then multiply our probabilities by the likelihood function as usual, and normalize. P(e) is easy to find, since it’s just whatever factor makes everything sum to one at the end. This is good, since it isn’t clear what P(e) would mean for a virtual event.

Actually, we can simplify even further. All we really need to know is the likelihood ratio: the ratio between the two numbers in the likelihood function. (I will illustrate this with an example soon). However, it may sometimes be easier to find the whole likelihood function in practice.

Let’s look at the path-dependence example again. As before, we start with:

	A	¬A
B	30%	20%
¬B	20%	30%

I want to apply a Jeffrey update which makes P(B)=60%. However, let’s represent the update via virtual evidence this time. Currently, P(B)=50%. To take it to 60%, we need to see virtual evidence with a 60:40 likelihood ratio, such as P(B|E)=60%, P(¬B|E)=40%. This gives us the same update as before:

	A	¬A
B	36%	24%
¬B	16%	24%

(Note that we would have gotten the same result with a likelihood function of P(B|E)=3%, P(¬B|E)=2%, since 60:40 is the same as 3:2. That’s what I meant when I said that only the ratio matters.)

But now we want to apply the same update to A as we did to B. So now we update on virtual evidence P(A|E)=60%, P(¬A|E)=40%. This gives us the following (approximately):

	A	¬A
B	43%	19%
¬B	19%	19%

As you can see, the result is quite symmetric. In general, virtual evidence updates will be path-independent, because multiplication is commutative (and the normalization step of updating doesn’t mess with this commutativity).

So, virtual evidence is a reformulation of Jeffrey updates with a lot of advantages:

Unlike raw Jeffrey updates, virtual evidence is path-independent.
You don’t have to decide right away what you’re updating to; you just have to decide the strength and direction of the update.
I don’t fully discuss this here, but Pearl argues persuasively that it’s easier to tell when a virtual-evidence update is appropriate than when a Jeffrey update is appropriate.

Because of these features, virtual evidence is much more useful for integrating information from multiple sources.

Integrating Expert Opinions

Suppose you have an ancient artefact. You want to know whether this artefact was made by ancient aliens. You have some friends who are also curious about ancient aliens, so you enlist their help.

You ask one friend who is a metallurgist. After performing experiments (the details of which you don’t understand), the metallurgist isn’t sure, but gives 80% that the tests would turn out that way if it were of terrestrial origin, and 20% for metals of non-terrestrial origin. (Let’s pretend that ancient aliens would 100% use metals of non-Earth origin, and that ancient humans would 100% use Earth metals.)

You then ask a second friend, who is an anthropologist. The anthropologist uses cultural signs, identifying the style of the art and writing. Based on that information, the anthropologist estimates that it’s half as likely to be of terrestrial origin as alien.

How do we integrate this information? According to Jeffrey and Pearl, we can apply the virtual evidence formula if we think the two expert judgements are independent. What ‘independence’ means for virtual evidence is a bit murky, since the evidence is not part of our probability calculus, so we can’t apply the usual probabilistic definition. However, Pearl argues persuasively that this condition is easier to evaluate in practice than the rigidity condition which governs the applicability of Jeffrey updates. (He also gives an example where rigidity is violated, so a naive Jeffrey update gives a nonsensical result but where virtual evidence can still be easily applied to get a correct result.)

The information provided by the anthropologist and the metallurgist seem to be quite independent types of information (at least, if we ignore the fact that both experts are biased by an interest in ancient aliens), so let’s apply the virtual evidence rule. The likelihood ratio from the metallurgist was 80:20, which simplifies to 4:1. The likelihood ratio from the anthropologist was 1:2. That makes the combined likelihood vector 2:1 in favor of terrestrial origin. We would then combine this with our prior; for example, if we had a prior of 3:1 in favor of a terrestrial origin, our posterior would be 6:1 in favor.

(Note that we also have to think that the virtual evidence is independent of our prior information.)

So, virtual evidence offers a practical way to integrate information when we cannot quantify exactly what the evidence was—a condition which is especially likely when consulting experts. This illustrates the utility of the bayes-with-a-side-channel model mentioned earlier; we are able to deal effectively with evidence, even when the exact nature of the evidence is hidden to us.

A few notes on how we gathered expert information in our hypothetical example.

We asked for likelihood ratios, rather than posterior probabilities. This allows us to combine the information as virtual evidence.
In the case of the metallurgist, it makes sense to ask for likelihood ratios, since the metallurgist is unlikely to have good prior information about the artefact. Asking only for likelihoods allows us to factor out any effect from this poor prior (and instead use our own prior, which may still be poor, but has the advantage of being ours).
In the case of the anthropologist, however, it doesn’t make as much sense—if we trust their expertise, we’re likely to think the anthropologist has a good prior about artefacts. It might have made more sense to ask for the anthropologist’s posterior, take it as our own, and then apply a virtual-evidence update to integrate the metallurgist’s report. (However, if we weren’t able to properly communicate our own prior information to the anthropologist, it would be ignored in such an approach.)
In the case of the metallurgist, it felt more natural to give a full likelihood function, rather than a likelihood ratio. It makes sense to know the probability of test result given a particular substance. It would have made even more sense if the likelihood function were a function of each metal the artefact could be made of, rather than just “terrestrial” or “extraterrestrial”—using broad categories allows the metallurgist’s prior about specific substances to creep in, which might be unfortunate.
In the case of the anthropologist, however, it didn’t make sense to give a full likelihood function. “The probability that the artefact would look exactly the way it looks assuming that it’s made by humans” is very very low, and seems quite difficult and unnatural to evaluate. It seems much easier to come up with a likelihood ratio, comparing the probability of terrestrial and extraterrestrial origin.

Why did Pearl devote several sections to virtual evidence, in a book which is otherwise a bible for dogmatic probabilists? I think the main reason is the close analogy to the mathematics of Bayesian networks. The message-passing algorithm which makes Bayesian networks efficient is almost exactly the virtual evidence procedure I’ve described. If we think of each node as an expert trying to integrate information from its neighbors, then the efficiency of Bayes nets comes from the fact that they can use virtual evidence to update on likelihood functions rather than needing to know about the evidence in detail. This may have even been one source of inspiration for Pearl’s belief propagation algorithm?

Can Dogmatic Probabilists Use Virtual Evidence?

OK, so we’ve put Jeffrey’s radical updates into a more palatable form—one which borrows the structure and notation of classical Bayesian updates.

Does this mean orthodox Bayesians can join the party, and use virtual evidence to accomplish everything a radical probabilist can do?

No.

Virtual evidence abandons the ratio formula.

One of the longstanding axioms of classical Bayesian thought is the ratio formula for conditional probability that Bayes himself introduced:

P (A | B) = \frac{P (A & B)}{P (B)}

Virtual evidence, as an updating practice, holds that $P (A | B)$ can be usefully defined in cases where the ratio $P (A & B) / P (B)$ cannot be usefully defined. Indeed, virtual evidence treats Bayes’ Law (which is usually a derived theorem) as more fundamental than the ratio formula (which is usually taken as a definition).

Granted, dogmatic probabilism as I defined it at the beginning of this post does not explicitly assume the ratio formula. But the assumption is so ingrained that I assume most readers took $P (A | B)$ to mean the ratio.

Still, even so, we can consider a version of dogmatic probabilism which rejects the ratio formula. Couldn’t they use virtual evidence?

Virtual evidence requires probability functions to take arguments which aren’t part of the event space.

Even abandoning the ratio formula, still, it’s hard to see how a dogmatic probabilist could use virtual evidence without abandoning the Kolmogorov axioms as the foundation of probability theory. The Kolmogorov axioms make probabilities a function of events; and events are taken from a pre-defined event space. Virtual evidence constructs new events at will, and does not include them in an overarching event space (so that, for example, virtual evidence $V$ can be defined—so that $P (X | V)$ is meaningful for all $X$ from the event space—without events like $X & V$ being meaningful, as would be required for a sigma-algebra).

I left some wiggle room in my definition, saying that a dogmatic probabilist might endorse the Kolmogorov axioms “or a similar axiomatization of probability theory”. But even the Jeffrey-Bolker axioms, which are pretty liberal, don’t allow enough flexibility for this!

Representing Fluid Updates

A final point about virtual evidence and Jeffrey updates.

Near the beginning of this essay, I gave a picture in which Jeffrey updates generalize Bayesian updates, but fluid updates generalize things even further, opening up the space of possibilities when rigidity does not hold.

However, I should point out that any update is a Jeffrey update on a sufficiently fine partition.

So far, for simplicity, I’ve focused on binary partitions: we’re judging between H and ¬H, rather than a larger set such as $H_{1}, H_{2}, H_{3}$ . However, we can generalize everything to arbitrarily sized partitions, and will often want to do so. I noted that a larger set might have been better when asking the metallurgist about the artefact, since it’s easier to judge the probability of test results given specific metals rather than broad categories.

If we make a partition large enough to cover every possible combination of events, then a Jeffrey update is now just a completely arbitrary shift in probability. Or, alternatively, we can represent arbitrary shifts via virtual evidence, by converting to likelihood-ratio format.

So, these updates are completely general after all.

Granted, there might not be any point to seeing things that way.

Non-Sequential Prediction

One advantage of radical probabilism is that it offers a more general framework for statistical learning theory. I already mentioned, briefly, that it allows one to do away with the realizability/grain-of-truth assumption. This is very important, but not what I’m going to dwell on here. Instead I’m going to talk about non-sequential prediction, which is a benefit of logical induction which I think has been under-emphasized so far.

Information theory—in particular, algorithmic information theory—in particular, Solomonoff induction—is restricted to a sequential prediction frame. This means there’s a very rigid observation model: observations are a sequence of tokens and you always observe the nth token after observing tokens one through n-1.

Granted, you can fit lots of things into a sequential prediction model. However, it is a flaw the otherwise close relationship between Bayesian probability and information theory. You’ll run into this if you try to relate information theory and logic. Can you give an information-theoretic intuition for the laws of probability that deal with logical combinations, such as P(A or B) + P(A and B) = P(A) + P(B)?

I’ve complained about this before, offering a theorem which (somewhat) problematizes the situation, and suggesting that people should notice whether or not they’re making sequential-prediction style assumptions. I almost included related assumptions in my definition of dogmatic probabilism at the beginning of this post, but ultimately it makes more sense to contrast radical probabilism to the more general doctrine of Bayesian updates.

Sequential prediction cares only about the accuracy of beliefs at the moment of observation; the accuracy of the full distribution over the future is reduced to the accuracy about each next bit as it is observed.

If information is coming in “in any old way” rather than according to the assumptions of sequential prediction, then we can construct problematic cases for Solomonoff induction. For example, if we condition the nth bit to be 1 (or 0) when a theorem prover proves (or refutes) the nth sentence of Peano arithmetic, then Solomonoff induction will never assign positive probability to hypotheses consistent with Peano arithmetic, and will therefore do poorly on this prediction task. This is despite the fact that there are computable programs which do better at this prediction task; for example, the same theorem prover running just a little bit faster can have highly accurate beliefs at the moment of observation.

In non-sequential prediction, however, we care about accuracy at every moment, rather than just at the moment of observation. Running the same theorem prover, just one step faster, doesn’t do very well on that metric. It allows you to get things right just in time, but you won’t have any clue about what probabilities to assign before that. We don’t just want the right conclusion; we want to get there as fast as possible, and (in a subtle sense) via a rational path

Part of the difficulty of non-sequential prediction is how to score it. Bayes loss applied to your predictions at the moment of observation, in a sequential prediction setting, seems quite useful. Bayes loss applied to all your beliefs, at every moment does not seem very useful.

Radical probabilism gives us a way to evaluate the rationality of non-sequential predictions—namely, how vulnerable the sequence of belief distributions was to losing money via some sequence of bets.

Sadly, I’m not yet aware of any appropriate generalization of information theory—at least not one that’s very interesting. (You can index information by time, to account for the way probabilities stift over time… but that does not come with a nice theory of communication or compression, which are fundamental to classical information theory.) This is why I objected to prediction=compression in the discussion section of Alkjash’s talk.

To summarize, sequential prediction makes three critical assumptions which may not be true in general:

It assumes observations will always inform us about one of a set of observable variables. In general, Bayesian updates can instead inform us about any event, including complex logical combinations (such as “either the first bit is 1, or the second bit is 0”).
It assumes these observations will be made in a specific sequence, whereas in general updates could come in in any order.
It assumes that what we care about is the accuracy of belief at the time of observation; in general, we may care about the accuracy of beliefs at other times.

The only way I currently know how to get theoretical benefits similar to those of Solomonoff induction while avoiding all three of these assumptions is radical probabilism (in particular, as formalized by logical induction).

(The connection between this section and radical probabilism is notably weaker than the other parts of this essay. I think there is a lot of low-hanging fruit here, fleshing out the space of possible properties, the relationship between various problems and various assumptions, trying to generalize information theory, clarifying our concept of observation models, et cetera.)

Making the Meta-Bayesian Update

In Pascal’s Muggle (long version, short version) Eliezer discusses situations in which he would be forced to make a non-Bayesian update:

But if I actually see strong evidence for something I previously thought was super-improbable, I don’t just do a Bayesian update, I should also question whether I was right to assign such a tiny probability in the first place—whether the scenario was really as complex, or unnatural, as I thought. In real life, you are not ever supposed to have a prior improbability of 10^-100 for some fact distinguished enough to be written down, and yet encounter strong evidence, say 10¹⁰ to 1, that the thing has actually happened. If something like that happens, you don’t do a Bayesian update to a posterior of 10^-90. Instead you question both whether the evidence might be weaker than it seems, and whether your estimate of prior improbability might have been poorly calibrated, because rational agents who actually have well-calibrated priors should not encounter situations like that until they are ten billion days old. Now, this may mean that I end up doing some non-Bayesian updates: I say some hypothesis has a prior probability of a quadrillion to one, you show me evidence with a likelihood ratio of a billion to one, and I say ‘Guess I was wrong about that quadrillion to one thing’ rather than being a Muggle about it.

At the risk of being too cutesy, I want to make two related points:

At the object level, radical probabilism offers a framework in which we can make these sorts of non-Bayesian updates. We can encounter something which makes us question our whole way of thinking. It also allows us to significantly revise that way of thinking, without modeling the situation as something extreme like self-modification (or even something very out of the ordinary).
At the meta level, updating to radical probabilism is itself one of these non-Bayesian updates. Of course, if you were really a hard-wired dogmatic probabilist at core, you would be unable to make such an update (except perhaps if we model it as self-modification). But, since you are already using reasoning which is actually closer in spirit to radical probabilism, you can start to model yourself in this way and using radical-probabilist ideas to guide future updates.

So, I wanted to use this penultimate section for some advice about making the leap.

It All Adds Up to Normality

Radical Probabilism is not a license to update however you want, nor even an invitation to massively change the way you update. It is primarily a new way to understand what you are already doing. Yes, it’s possible that viewing things through this lense (rather than the more narrow lense of dogmatic probabilism) will change the way you see things, and as a consequence, change the way you do things. However, you are not (usually) making some sort of mistake by engaging in the sort of Bayesian reasoning you are familiar with—there is no need to abandon large portions of your thinking.

Instead, try to notice ordinary updates you make which are not perfectly understood as Bayesian updates.

Calibration corrections are not well-modeled as Bayesian updates. If you say to yourself “I’ve been overconfident in similar situations”, and lower your probability, your shift is better-understood as a fluid update.
Many instances of “outside view” are not well-modeled in a Bayesian update framework. You’ve probably seen outside view explained as prior probability. However, you often take the outside view on one of your own arguments, e.g. “I’ve often made arguments like this and been wrong”. This kind of reflection doesn’t fit well in the framework of Bayesian updates, but fits fine in a radical-probabilist picture.
It is often warranted to downgrade the probability of a hypothesis without having an alternative in mind to upgrade. You can start to find a hypothesis suspicious without having any better way of predicting observations. For example, a sequence of surprising events might stick out to you as evidence that your hypothesis is wrong, even though your hypothesis is still the best way that you know to try and predict the data. This is hard to formalize as a Bayesian update. Changes in probability between hypotheses always remain balanced. It’s true that you move the probability to a “not the hypotheses I know” category which balances the probability loss, but it’s not true that this category earned the increased probability by predicting the data better. Instead, you used a set of heuristics which have worked well in the past to decide when to move probabilities around.

Don’t Predictably Violate Bayes

Again, this is not a license to violate Bayes’ Rule whenever you feel like it.

A radical probabilist should obey Bayes’ Law in expectation, in the following sense:

If some evidence E or ¬E is bound to be observed by time m>n, then the following should hold:

E_{n} (P_{m} (H) | E) = P_{n} (H | E)

And the same for ¬E. In other words, you should not expect your updated beliefs to differ from your conditional probabilities on average.

(You should suspect from the fact that I’m not proving this one that I’m playing a bit fast and loose—whether this law holds may depend on the formalization of radical probabilism, and it probably needs some extra conditions I haven’t stated, such as P(E)>0.)

And remember, every update is a Bayesian update, with the right virtual evidence.

Exchange Virtual Evidence

Play around with the epistemic practice Jeffrey suggests. I suspect some of you already do something similar, just not necessarily calling it by this name or looking so closely at what you’re doing.

Don’t Be So Realist About Your Own Utility Function

Note that the picture here is quite compatible with what I said in An Orthodox Case Against Utility Functions. Your utility function need not be computable, and there need not be something in your ontology which you can think of your utility as a function of. All you need are utility expectations, and the ability to update those expectations. Radical Probabilism adds a further twist: you don’t need to be able to predict those updates ahead of time; indeed, you probably can’t. Your values aren’t tied to a function, but rather, are tied to your trust in the ongoing process of reasoning which refines and extends those values (very much like the self-trust discussed in the section on conservation of expected evidence).

Not So Radical After All