A quantum equivalent to Bayes’ rule

This post is an attempt to summarise and explain for the LW readership the contents of this paper: “Quantum Bayes’ rule and Petz transpose map from the minimum change principle”. It’s a highly technical paper heavy on quantum mechanics formalism that took me a couple of days to unpack and digest a bit, but I think it may be important going forward. My work on it is far from done, but this is a quick introduction.

Epistemic status: I have a Physics PhD and spent about ten years working with computational quantum mechanics so hopefully I know what I’m talking about, but if anyone can peer review I’ll be glad for the help.

The tagline of Astral Codex Ten reads:

P(A|B) = [P(A)*P(B|A)]/​P(B), all the rest is commentary.

This sentence could very well exemplify the ethos of the rationalist community as a whole[1], but looking at it from a physics perspective, it misses something. Bayes’ theorem is a statement about information—it tells us how to update previous knowledge (a distribution of probabilities over potential world-states) using newly acquired information to refine it. Yet, the way it defines the knowledge is classical. There are states, and there are finite, real probabilities (that sum to 1) attached to them.

We know the world not to be classical. The world is quantum. Going into details about what this implies would make this post quite longer, but for the informational angle that we care about here I direct you to Scott Aaronson’s excellent lecture on the topic and will only include here a very brief summary:

  • a quantum description of a system does not assign real probabilities to each state, but complex amplitudes, whose square magnitude sums to 1;

  • if you multiply an entire system by a constant phase factor, nothing changes; differences in phase between states however matter a lot;

  • quantum evolution acts on both amplitudes and phases, and this allows for interference phenomena that allow a lot of the weirder quantum stuff to happen, as probabilities don’t need to sum up in the way we’re used to and can even cancel each other out (this is why the infamous double slit experiment produces fringes);

  • the universe should by all means be a “pure” state; however in many situations it’s convenient and possible to describe ensembles of physical subsystems to be in a mixed state, which represents a classical probability distribution of quantum states. This is mathematically represented with an object called a density matrix: its diagonal is a classic probability distribution over the state (and thus its trace is always 1), but its off-diagonal elements contain the phase information. For example, if we had a mole of atoms prepared in some quantum state with a certain distribution of uncertainty, this would be a good use case for a density matrix;

  • a density matrix whose off-diagonal elements are all zero is “decohered”, and can be considered the classical limit of this. A decohered density matrix behaves exactly like a classical distribution, and follows classic Markovian dynamics;

  • from an information theory viewpoint, coherent quantum states have very different properties. This is what quantum computers are all about; if we exploit the mathematical structure of quantum mechanics, the laws about, for example, which problems are solvable in polynomial time change, because the interference phenomena allow for some new tricks that you couldn’t otherwise pull off.

This tells us that in fact having a complete quantum description of information is really important, and we may be missing some crucial elements if we don’t. Nevertheless, until now, I had not really seen any satisfactory quantum equivalent of Bayes’ theorem. Even the so-called QBism (Quantum Bayesianism) interpretation of quantum mechanics seemed to lack this element, and operated more on a qualitative sense. While not everyone agrees and the philosophical debate still rages, it is at least reasonable (and often done) to consider the quantum state as a description of our knowledge about a system, rather than necessarily a true, physical thing. It is however strange that we don’t really know how to update that knowledge freely the way we do with the classical kind.

This paper seems to remedy that. I’ll go through the following steps to explain its contents:

  1. first, I’ll re-derive the classical Bayes theorem using their formalism, which they build to create an analogy with the quantum case;

  2. then, I’ll explain qualitatively the way this process is translated to quantum formalism (you can read the paper for the hard stuff but honestly I feel like I wouldn’t add much to it)

  3. finally, I’ll link some code I wrote and show a few examples which I hope are correct.

As I said, this is still very much a WIP. Let me know if you spot any mistakes or want to contribute.

Classical derivation

Suppose you have two systems, and , each with a set of possible states:

We start with a probability distribution on , , and we know also a conditional probability distribution, or likelihood (which is essentially a matrix) . Now suppose we observe a certain state : then how should we update our knowledge of ? Bayes’ rule says:

Now, asks the paper, suppose that instead of a definite value we observe over a certain number of trials a distribution of outcomes, . How are we to generalise this rule to update our knowledge about ? A natural extension is:

, of which the classic formulation is just a limit for when our is 1 in one state and 0 everywhere else. You might notice that this is a bit like a stochastic or Markov process, in which is the starting state and the transition matrix. This is the view taken by the paper—we consider the likelihood and the posterior probability to both be akin to processes which operate on one distribution to produce another[2]. So their approach to recovering Bayes’ theorem is the following:

  1. express the joint prior probability distribution of and , called , by applying our original likelihood to our prior. This expresses our initial expectation or knowledge of the combined system;

  2. express the joint probability distribution informed by our new knowledge, [3], by applying some hitherto unknown posterior distribution to our observed distribution on ;

  3. minimise the distance[4] between the two distributions, which basically means “learn as much as you can from this new information and not one bit more”, and under this “minimum change” principle find the correct formulation for the posterior distribution, which lo and behold, will turn out to be the well-known Bayes’ rule!

Below the full derivation follows. Feel free to skip if the math is too much; as long as you follow me on the logic above, that should be enough.

Minimum change derivation of Bayes’ rule

We define our joint distributions:

Here remember that since we’re trying to recover Bayes’ rule, is our unknown quantity, that we’re trying to retrieve through variational principles. We try to minimise the Kullback-Leibler divergence:

Subject to a normalisation constraint:

We can unify this problem by defining an objective function that makes use of Lagrange multipliers:

To solve, we differentiate and solve for zero gradient:

Isolating in the first equation:

Substituting in the second:

Which then gives us back our Bayes’ rule:

C.V.D.

Quantum derivation

Now comes the spicy part—how do we make this process into a quantum one? The process analogy is crucial: we know how to apply transformations to quantum systems! The most general form of it is called a “quantum channel” and it allows you to express any kind of transformation from one state to another (including irreversible ones, which could for example simulate interaction with an outside environment—the important thing is that they have to preserve the trace, so that real probabilities always keep summing to 1). This usually means some kind of time evolution, but the formalism doesn’t require that. So we can establish a correspondence:

  • a classical probability becomes quantum density matrix

  • a classical process (like the likelihood, or the posterior) becomes a quantum channel

As long as we can express the joint probability distribution as a quantum density matrix, we can then apply some measure of distance between states (the one they use is called quantum fidelity, though their convention does not include the squaring that appears on Wikipedia); given a prior density matrix, a quantum channel expressing the likelihood, and an observed end state, we can then maximise this fidelity (namely, minimise the distance) between the two joint probability distributions to find a “reversed” quantum channel that back-propagates the observed distribution into an updated knowledge of our system.

What happens in practice is that, given a quantum state in the Hilbert space , they “purify” it, namely, they sort of duplicate it so that we get a bigger state that contains two “copies” of it. Then we apply the quantum channel only to one of the two copies, which means we get a final state that is a joint description of both the starting point (the unaltered copy) and the end one (the copy to which the channel was applied).

We can of course do the same in reverse if we know the final state instead; duplicate it, apply the backwards channel to one of the copies (if we want them to be comparable, it has to be the other one compared to the forward process) and get another joint quantum state out.

The final result

We have these correspondences:

ElementClassicalQuantum
PriorDistribution Density matrix
LikelihoodConditional distribution Quantum channel
Observed informationDistribution Density matrix
PosteriorConditional distribution Quantum channel

Then the “posterior channel” that maximises fidelity is found to be:

with

. In the case that and commute (this will for example happen commonly if we take the prior to be uniform), this reduces to

, a formula also known as the Petz transpose map, and which is completely independent of .

Crunching some numbers

Here is the code I wrote, using Python and the library Qutip, for a quick test of this process on the simplest possible system (a single qubit subject to a probabilistic flip).

Here is a few outputs for very simple cases.

Uniform prior, decohered output

This is a purely classical case. We’re starting with a uniform, decohered prior on the qubit, and after applying a spin flip with we observe a fully classical state (probabilities of up and down).

Starting gamma (prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Starting tau (observed distribution on Y):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.68+0.j 0.  +0.j]
 [0.  +0.j 0.32+0.j]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Fidelity: 0.9486833043041707

The result is as expected from the classical Bayes’ theorem: the updated knowledge on is .

Coherent prior, output set to the observed state

This is a case of setting a coherent prior (the qubit is in a perfect up + down superposition) and then setting the observation to exactly match the output, which should retrieve the original state.

Starting gamma (prior):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]

Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j  0. +0.5j]
 [0. -0.5j 0.5+0.j ]]

Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j  0. +0.5j]
 [0. -0.5j 0.5+0.j ]]

Starting tau (observed distribution on Y):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]

Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.5+0.j  0. +0.4j]
 [0. -0.4j 0.5+0.j ]]

Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.5+0.j  0. +0.4j]
 [0. -0.4j 0.5+0.j ]]

Fidelity: 0.9000000050662574

We see that this definitely does happen—the guess about is correct. But the fidelity is not 1. This is not necessarily a contradiction—the matrices printed out here are merely “partial traces” which discard the fact that, this being a quantum description, there are correlations between the two subsystems that are expressed only in the full density matrix. So it’s the correlations that are not printed out meaningfully here, but are still expressed in the off diagonal terms of the density matrix and contribute to the non-perfect fidelity. I assume this is kind of like: if you start with a prior that your coin is fair before observing any throws, vs if you observe 10 throws that fall perfectly 5 heads, 5 tails, your assumption on the coin’s nature is not going to change, but your belief distribution is, and this is what’s entailed. But there might be something more subtle I’ve missed.

Conclusion

I don’t really have a good, impactful result to conclude this on. I wanted to quickly put this post out as I could enable others to also look at this and potentially contribute. My impression however is that there are some really interesting things to attempt here. One obvious thing that I might try next is quantum measurement—you can express that in terms of quantum channels, and “how do I do Bayesian inference back the original state of a system given the classical outcomes of a quantum measurement” seems like an interesting thing to investigate that might have some insights on the way our knowledge interacts with quantum systems.

  1. ^

    Die-hard Bayesianism and an above average appreciation for obscure kabbalistic culture references.

  2. ^

    If you want to think about them in linear algebra terms, since we’re working with finite numbers of states:

    • a probability distribution is going to be a vector;

    • the likelihood and the posterior distributions are and matrices respectively;

    • the joint probability distributions are also matrices; they come about by multiplying the columns of those by the elements of a probability distribution;

    • an output distribution is produced by using a dot product between a process (matrix) and a probability distribution on a single system (vector), resulting in a probability distribution on the other system (vector).

  3. ^

    I assume it’s supposed to stand for “reversed”.

  4. ^

    They suggest multiple ones work, but I focused on the Kullback-Leibler divergence.