A quantum equivalent to Bayes’ rule

dr_s31 Aug 2025 10:06 UTC

52 points

Bayes' Theorem Quantum Mechanics Bayesianism Information Theory World Modeling

This post is an attempt to summarise and explain for the LW readership the contents of this paper: “Quantum Bayes’ rule and Petz transpose map from the minimum change principle”. It’s a highly technical paper heavy on quantum mechanics formalism that took me a couple of days to unpack and digest a bit, but I think it may be important going forward. My work on it is far from done, but this is a quick introduction.

Epistemic status: I have a Physics PhD and spent about ten years working with computational quantum mechanics so hopefully I know what I’m talking about, but if anyone can peer review I’ll be glad for the help.

The tagline of Astral Codex Ten reads:

P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.

This sentence could very well exemplify the ethos of the rationalist community as a whole^[1], but looking at it from a physics perspective, it misses something. Bayes’ theorem is a statement about information—it tells us how to update previous knowledge (a distribution of probabilities over potential world-states) using newly acquired information to refine it. Yet, the way it defines the knowledge is classical. There are states, and there are finite, real probabilities (that sum to 1) attached to them.

We know the world not to be classical. The world is quantum. Going into details about what this implies would make this post quite longer, but for the informational angle that we care about here I direct you to Scott Aaronson’s excellent lecture on the topic and will only include here a very brief summary:

a quantum description of a system does not assign real probabilities to each state, but complex amplitudes, whose square magnitude sums to 1;
if you multiply an entire system by a constant phase factor, nothing changes; differences in phase between states however matter a lot;
quantum evolution acts on both amplitudes and phases, and this allows for interference phenomena that allow a lot of the weirder quantum stuff to happen, as probabilities don’t need to sum up in the way we’re used to and can even cancel each other out (this is why the infamous double slit experiment produces fringes);
the universe should by all means be a “pure” state; however in many situations it’s convenient and possible to describe ensembles of physical subsystems to be in a mixed state, which represents a classical probability distribution of quantum states. This is mathematically represented with an object called a density matrix: its diagonal is a classic probability distribution over the state (and thus its trace is always 1), but its off-diagonal elements contain the phase information. For example, if we had a mole of atoms prepared in some quantum state with a certain distribution of uncertainty, this would be a good use case for a density matrix;
a density matrix whose off-diagonal elements are all zero is “decohered”, and can be considered the classical limit of this. A decohered density matrix behaves exactly like a classical distribution, and follows classic Markovian dynamics;
from an information theory viewpoint, coherent quantum states have very different properties. This is what quantum computers are all about; if we exploit the mathematical structure of quantum mechanics, the laws about, for example, which problems are solvable in polynomial time change, because the interference phenomena allow for some new tricks that you couldn’t otherwise pull off.

This tells us that in fact having a complete quantum description of information is really important, and we may be missing some crucial elements if we don’t. Nevertheless, until now, I had not really seen any satisfactory quantum equivalent of Bayes’ theorem. Even the so-called QBism (Quantum Bayesianism) interpretation of quantum mechanics seemed to lack this element, and operated more on a qualitative sense. While not everyone agrees and the philosophical debate still rages, it is at least reasonable (and often done) to consider the quantum state as a description of our knowledge about a system, rather than necessarily a true, physical thing. It is however strange that we don’t really know how to update that knowledge freely the way we do with the classical kind.

This paper seems to remedy that. I’ll go through the following steps to explain its contents:

first, I’ll re-derive the classical Bayes theorem using their formalism, which they build to create an analogy with the quantum case;
then, I’ll explain qualitatively the way this process is translated to quantum formalism (you can read the paper for the hard stuff but honestly I feel like I wouldn’t add much to it)
finally, I’ll link some code I wrote and show a few examples which I hope are correct.

As I said, this is still very much a WIP. Let me know if you spot any mistakes or want to contribute.

Classical derivation

Suppose you have two systems, $X$ and $Y$ , each with a set of possible states:

X = {x_{0}, x_{1}, . . ., x_{n}}

Y = {y_{0}, y_{1}, . . ., y_{m}}

We start with a probability distribution on $X$ , $γ (x)$ , and we know also a conditional probability distribution, or likelihood (which is essentially a matrix) $φ (y | x)$ . Now suppose we observe a certain state $y$ : then how should we update our knowledge of $x$ ? Bayes’ rule says:

^φ (x | y) = \frac{φ (y | x) γ (x)}{\sum_{i} φ (y | x_{i}) γ (x_{i})}

Now, asks the paper, suppose that instead of a definite value $y$ we observe over a certain number of trials a distribution of outcomes, $τ (y)$ . How are we to generalise this rule to update our knowledge about $x$ ? A natural extension is:

γ^{'} (x) = \sum j^φ (x | y_{j}) τ (y_{j})

, of which the classic formulation is just a limit for when our $τ$ is 1 in one state and 0 everywhere else. You might notice that this is a bit like a stochastic or Markov process, in which $τ$ is the starting state and $^φ$ the transition matrix. This is the view taken by the paper—we consider the likelihood and the posterior probability to both be akin to processes which operate on one distribution to produce another^[2]. So their approach to recovering Bayes’ theorem is the following:

express the joint prior probability distribution of $X$ and $Y$ , called $P_{f w d}$ , by applying our original likelihood to our prior. This expresses our initial expectation or knowledge of the combined system;
express the joint probability distribution informed by our new knowledge, $P_{r e v}$ ^[3], by applying some hitherto unknown posterior distribution to our observed distribution on $Y$ ;
minimise the distance^[4] between the two distributions, which basically means “learn as much as you can from this new information and not one bit more”, and under this “minimum change” principle find the correct formulation for the posterior distribution, which lo and behold, will turn out to be the well-known Bayes’ rule!

Below the full derivation follows. Feel free to skip if the math is too much; as long as you follow me on the logic above, that should be enough.

Minimum change derivation of Bayes’ rule

We define our joint distributions:

P_{f w d} (x, y) = φ (y | x) γ (x)

P_{r e v} (x, y) =^φ (x | y) τ (y)

Here remember that since we’re trying to recover Bayes’ rule, $^φ (x | y)$ is our unknown quantity, that we’re trying to retrieve through variational principles. We try to minimise the Kullback-Leibler divergence:

D (P_{f w d}, P_{r e v}) = \sum x, y P_{f w d} (x, y) log (\frac{P_{f w d} (x, y)}{P_{r e v} (x, y)})

Subject to a normalisation constraint:

\sum x P_{r e v} (x, y) = τ (y) ⟹ \sum x^φ (x | y) = 1

We can unify this problem by defining an objective function that makes use of Lagrange multipliers:

O (^φ, λ) = D (P_{f w d},^φ * τ) + \sum y λ_{y} (\sum x^φ (x | y) - 1)

min^φ, λ O (^φ, λ)

To solve, we differentiate and solve for zero gradient:

\frac{\partial O}{\partial^φ (x | y)} = - \frac{φ (y | x) γ (x)}{^φ (x | y)} + λ_{y} = 0

\frac{\partial O}{\partial λ_{y}} = \sum x^φ (x | y) - 1 = 0

Isolating in the first equation:

^φ (x | y) = \frac{φ (y | x) γ (x)}{λ_{y}}

Substituting in the second:

λ_{y} = \sum x φ (y | x) γ (x)

Which then gives us back our Bayes’ rule:

^φ (x | y) = \frac{φ (y | x) γ (x)}{\sum_{x} φ (y | x) γ (x)}

C.V.D.

Quantum derivation

Now comes the spicy part—how do we make this process into a quantum one? The process analogy is crucial: we know how to apply transformations to quantum systems! The most general form of it is called a “quantum channel” and it allows you to express any kind of transformation from one state to another (including irreversible ones, which could for example simulate interaction with an outside environment—the important thing is that they have to preserve the trace, so that real probabilities always keep summing to 1). This usually means some kind of time evolution, but the formalism doesn’t require that. So we can establish a correspondence:

a classical probability becomes quantum density matrix
a classical process (like the likelihood, or the posterior) becomes a quantum channel

As long as we can express the joint probability distribution as a quantum density matrix, we can then apply some measure of distance between states (the one they use is called quantum fidelity, though their convention does not include the squaring that appears on Wikipedia); given a prior density matrix, a quantum channel expressing the likelihood, and an observed end state, we can then maximise this fidelity (namely, minimise the distance) between the two joint probability distributions to find a “reversed” quantum channel that back-propagates the observed distribution into an updated knowledge of our system.

What happens in practice is that, given a quantum state in the Hilbert space $H_{X}$ , they “purify” it, namely, they sort of duplicate it so that we get a bigger state that contains two “copies” of it. Then we apply the quantum channel only to one of the two copies, which means we get a final state that is a joint description of both the starting point (the unaltered copy) and the end one (the copy to which the channel was applied).

We can of course do the same in reverse if we know the final state instead; duplicate it, apply the backwards channel to one of the copies (if we want them to be comparable, it has to be the other one compared to the forward process) and get another joint quantum state out.

The final result

We have these correspondences:

Element	Classical	Quantum
Prior	Distribution $γ$	Density matrix $γ$
Likelihood	Conditional distribution $φ (y \| x)$	Quantum channel $E$
Observed information	Distribution $τ$	Density matrix $τ$
Posterior	Conditional distribution $^φ (x \| y)$	Quantum channel $R$

Then the “posterior channel” that maximises fidelity is found to be:

R (σ) = \sqrt{γ} E^{†} (D σ D^{†}) \sqrt{γ}

with

D = \sqrt{τ} (\sqrt{τ} E (γ) \sqrt{τ})^{- 1 / 2}

. In the case that $τ$ and $E (γ)$ commute (this will for example happen commonly if we take the prior to be uniform), this reduces to

R (σ) = \sqrt{γ} E^{†} (E (γ)^{- 1 / 2} σ E (γ)^{- 1 / 2}) \sqrt{γ}

, a formula also known as the Petz transpose map, and which is completely independent of $τ$ .

Crunching some numbers

Here is the code I wrote, using Python and the library Qutip, for a quick test of this process on the simplest possible system (a single qubit subject to a probabilistic flip).

Here is a few outputs for very simple cases.

Uniform prior, decohered output

This is a purely classical case. We’re starting with a uniform, decohered prior on the qubit, and after applying a spin flip with $p = 0.2$ we observe a fully classical state $τ = [0.8, 0.2]$ (probabilities of up and down).

Starting gamma (prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j 0. +0.j]
 [0. +0.j 0.5+0.j]]

Starting tau (observed distribution on Y):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.68+0.j 0.  +0.j]
 [0.  +0.j 0.32+0.j]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.8+0.j 0. +0.j]
 [0. +0.j 0.2+0.j]]

Fidelity: 0.9486833043041707

The result is as expected from the classical Bayes’ theorem: the updated knowledge on $X$ is $γ^{'} = [0.68, 0.32]$ .

Coherent prior, output set to the observed state

This is a case of setting a coherent prior (the qubit is in a perfect up + down superposition) and then setting the observation to exactly match the output, which should retrieve the original state.

Starting gamma (prior):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]

Purified gamma, ptrace on A_2 (prior, not operated on):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]
Purified gamma, ptrace on A_1 (tranposed prior):
[[0.5+0.j  0. +0.5j]
 [0. -0.5j 0.5+0.j ]]

Processed gamma, ptrace on A_2 (posterior):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]
Processed gamma, ptrace on A_1 (transposed prior):
[[0.5+0.j  0. +0.5j]
 [0. -0.5j 0.5+0.j ]]

Starting tau (observed distribution on Y):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]

Purified tau, ptrace on A_2 (observed distribution on Y):
[[0.5+0.j  0. -0.4j]
 [0. +0.4j 0.5+0.j ]]
Purified tau, ptrace on A_1 (observed distribution on Y, transposed):
[[0.5+0.j  0. +0.4j]
 [0. -0.4j 0.5+0.j ]]

Commutator [tau, E(gamma)] = 0.0
Processed tau, ptrace on A_2 (updated knowledge on X):
[[0.5+0.j  0. -0.5j]
 [0. +0.5j 0.5+0.j ]]
Processed tau, ptrace on A_1 (observed distribution on Y, tranposed):
[[0.5+0.j  0. +0.4j]
 [0. -0.4j 0.5+0.j ]]

Fidelity: 0.9000000050662574

We see that this definitely does happen—the guess about $X$ is correct. But the fidelity is not 1. This is not necessarily a contradiction—the matrices printed out here are merely “partial traces” which discard the fact that, this being a quantum description, there are correlations between the two subsystems that are expressed only in the full density matrix. So it’s the correlations that are not printed out meaningfully here, but are still expressed in the off diagonal terms of the density matrix and contribute to the non-perfect fidelity. I assume this is kind of like: if you start with a prior that your coin is fair before observing any throws, vs if you observe 10 throws that fall perfectly 5 heads, 5 tails, your assumption on the coin’s nature is not going to change, but your belief distribution is, and this is what’s entailed. But there might be something more subtle I’ve missed.

Conclusion

I don’t really have a good, impactful result to conclude this on. I wanted to quickly put this post out as I could enable others to also look at this and potentially contribute. My impression however is that there are some really interesting things to attempt here. One obvious thing that I might try next is quantum measurement—you can express that in terms of quantum channels, and “how do I do Bayesian inference back the original state of a system given the classical outcomes of a quantum measurement” seems like an interesting thing to investigate that might have some insights on the way our knowledge interacts with quantum systems.

^
Die-hard Bayesianism and an above average appreciation for obscure kabbalistic culture references.
^
If you want to think about them in linear algebra terms, since we’re working with finite numbers of states:
- a probability distribution is going to be a vector;
- the likelihood and the posterior distributions are $n \times m$ and $m \times n$ matrices respectively;
- the joint probability distributions are also matrices; they come about by multiplying the columns of those by the elements of a probability distribution;
- an output distribution is produced by using a dot product between a process (matrix) and a probability distribution on a single system (vector), resulting in a probability distribution on the other system (vector).
^
I assume it’s supposed to stand for “reversed”.
^
They suggest multiple ones work, but I focused on the Kullback-Leibler divergence.

dr_s31 Aug 2025 10:06 UTC

52 points

18 comments8 min readLW link

Bayes' Theorem Quantum Mechanics Bayesianism Information Theory World Modeling

Adam Shai 31 Aug 2025 18:56 UTC
14 points
0
Somewhat related, in our recent preprint we showed how Bayesian updates work over quantum generators of stochastic processes. It’s a different setup than the one you show here, but does give a generalization of Bayes to the quantum and even post-quantum setting. We also show that the quantum (and post-quantum) Bayesian belief states are what transformers and other neural nets learn to represent during pre-training. This happens because to predict the future of a sequence optimaly given some history of that sequence, the best you can do is to perform Bayes on the hidden latent states of the (sometimes quantum) generator of the data.
- dr_s 2 Sep 2025 5:37 UTC
  2 points
  0
  Parent
  Read—thanks! Not digested in detail yet but enough to get the general outline. I didn’t remember your name so was pleasantly surprised to realise this was a “sequel” to Transformers represent belief state geometry in their residual streams.
  Focusing on the Bayesian part, if I understand correctly you do this on measurements though, so it’s kind of semiclassical? Basically you measure one quantity, get an outcome $p \in [0, 1]$ , then normalise and use that as your token plus evolve the system with that operator. This would have to be an ensemble operation on prepared qubits in a real experiment since the measured quantities aren’t orthogonal. I guess I could try and check whether we recover your update formula for the case of discrete measurements, but I’m not sure how to go about it off the top of my head.
- dr_s 31 Aug 2025 19:25 UTC
  2 points
  0
  Parent
  Oh, that sounds interesting! Definitely gonna check this out.
transhumanist_atom_understander 2 Sep 2025 2:05 UTC
8 points
0
The “Classical derivation” made more sense to me after translating it to standard probability notation, so I’m commenting to share the “dictionary” I made for it, as well as the unexpected extra assumption I had to make.

The obvious:

$γ (x) = P [X = x]$

$φ (y | x) = P [Y = y | X = x]$

$^φ (x | y) = P [X = x | Y = y]$

It got tricky with $τ$ . Instead of observing $Y = y$ , we observe something else that gives us a probability distribution over $Y$ . I considered this “something else” to be the value of some other unknown: $Z = z$ . The probability distribution over $y$ is a conditional distribution:

$τ (y) = P [Y = y | Z = z]$

Hate to have $z$ on only one side like that… maybe I should have called it $τ_{z}$ … but I’ll leave it as is.

Then,

$γ^{'} (x) = \sum j P [X = x | Y = y_{j}] P [Y = y_{j} | Z = z]$

Not quite the right formula for a simple interpretation of $γ^{'}$ … if only

$P [X = x | Y = y_{j}] = P [X = x | Y = y_{j}, Z = z]$

This is conditional independence, which could be represented with this Bayes net:

$Z \to Y \to X$

Then, we have

$γ^{'} (x) = P [X = x | Z = z]$

That completes the dictionary.

So to do what feels like ordinary probability theory, I had to introduce this extra unknown $Z$ so that we have something to observe, and then to assume that $Z$ only provides information about $Y$ (and indirectly about $X$ , through $Y$ ).

The way you described $τ$ as some probability distribution resulting from an observation, but not a conditional distribution, is in philosophy called Jeffrey conditionalization. The Stanford Encyclopedia of Philosophy gives this example:

A gambler is very confident that a certain racehorse, called Mudrunner, performs exceptionally well on muddy courses. A look at the extremely cloudy sky has an immediate effect on this gambler’s opinion: an increase in her credence in the proposition (muddy) that the course will be muddy—an increase without reaching certainty. Then this gambler raises her credence in the hypothesis (win) that Mudrunner will win the race, but nothing becomes fully certain. (Jeffrey 1965 [1983: sec. 11.3])

The idea is, we go from one probability distribution over ${muddy, \neg muddy}$ to another, without becoming certain of anything. My introduction of $Z$ corresponds to introducing an unknown representing the status of the sky. I would say we are conditioning on $Z = cloudy$ .

I recalled vaguely that Jaynes discussed Jeffrey conditionalization in Probability Theory, and criticized it for holding only in a special case. I took a look, and sure enough, it’s in section 5.6, and he’s pointing out exactly what I did, right down to the arrows, though he calls it a “logic flow diagram” rather than identifying it as a Pearl-style Bayes net.
- dr_s 2 Sep 2025 5:08 UTC
  2 points
  0
  Parent
  I don’t think you necessarily need a $Z$ though. My interpretation of that step was “suppose we know $X$ is a constant but hidden reality, and $Y$ is observable. Then we perform $N$ experiments and measure the resulting $Y$ , and thus characterise a probability distribution of it. How does that inform our guess on $X$ ?”. And yeah, that could be mediated by a third variable, but it doesn’t need to. If $X$ is “the coin is fair, or the coin is loaded to land 75% of the time on heads” and $Y$ is “the result of a coin toss”, you get a better (lower entropy) belief distribution on $X$ after several tosses.
  
  Thanks for this by the way, I used the paper’s notation but agree it was a bit confusing so this probably helps people!
  - transhumanist_atom_understander 2 Sep 2025 13:50 UTC
    1 point
    0
    Parent
    Yeah… well, I thought of the $Z$ because it sounds like we’re getting the probabilities of $Y$ from some experiment. So $Z = z$ is the results of the experiment, which in this case is a vector of frequencies. When I put it like that, it sounds like it’s is just a rhetorical device for saying that we have given probabilities of $Y$ .
    
    But I still seem to need $Z$ for my dictionary. I have $γ (x) = P [X = x]$ . What is $γ^{'} (x)$ ? It is some kind of updated probability of $X = x$ , right? Like we went from one probability to the other by doing an experiment. If I didn’t write $γ^{'} (x) = P [X | Z = z]$ , I’d need something like $γ (x) = P_{1} [X = x]$ and $γ^{'} (x) = P_{2} [X = x]$ .
    
    Reading again, it seems like this is exactly Jeffrey conditionalization. So whether you include some extra variable just depends on what you think of Jeffrey conditionalization.
    
    I feel like I’m missing something, though, about what this experiment is and means. For example, I’m not totally clear on whether we have one state $X$ , and a collection of replicates of state $Y$ ; or is it a collection of replicates of $(X, Y)$ pairs?
    
    Looking at the paper, I see the connection to Jeffrey conditionalization is made explicitly. And it mentions Pearl’s “virtual evidence method”; is this what he calls introducing this $Z$ ? But no clarity on exactly what this experiment is. It just says:
    
    But how should the above be generalized to the situation where the new information does not come in the form of a definite value $y_{0}$ for $Y$ , but as “soft evidence,” i.e., a probability distribution $τ (y)$ ?”
    
    By the way, regarding your coin toss example, I can at least say how this is handled in Bayesian statistics. There are separate random variables for each coin toss. $Y_{1}$ is the first, $Y_{2}$ is the second, etc. If you have $n$ coin tosses, then your sample is a vector $\to Y$ containing $Y_{1}$ to $Y_{n}$ . Then the posterior probability is $P [loaded | \to Y = \to y]$ . This will be covered in any Bayesian statistics textbook as “the Bernoulli model”. My class used Hoff’s book, which provides a quick start.
    
    I guess this example suggests a single unknown $X$ (whether the coin is loaded or not) and replicates of $Y$ .
    - dr_s 2 Sep 2025 16:39 UTC
      3 points
      0
      Parent
      Yes, I’m aware of the Bernoulli model—my point is that the vector $\to Y$ is itself the outcome of that experiment; I suppose you can call it $Z$ though it makes the notation a bit confusing. The general point is that yes, you update your belief about $X$ based on a series of outcomes on $Y$ . In fact I think in general $γ^{'} (x) = P [X = x | \to Y]$ works just fine.
James Camacho 31 Aug 2025 15:04 UTC
7 points
0
Is there a reason they switched from divergence to fidelity when going quantum? You should want to get the classical Bayes’ rule in the limit as your density matrices become classical, and fidelity definitely doesn’t give you that.
- dr_s 31 Aug 2025 15:41 UTC
  4 points
  0
  Parent
  Quoting from the paper:
  
  Fidelity is one of the most natural measures of the closeness between quantum states and has found countless applications in quantum information theory.
  
  I agree that this sort of quantum relative entropy should also be doable. It’s possible that the result would be the same. I guess an easy check would be to perturb the posterior and check whether this measure also has a minimum around the same point.
  - James Camacho 31 Aug 2025 16:06 UTC
    3 points
    0
    Parent
    Yeah, that was about the only sentence I read in the paper. I was wondering if you’d seen a theoretical justification (logos) rather than just an ethical appeal (ethos), but didn’t want to comb through the maths myself. By the way, fidelity won’t give the same posterior. I haven’t worked through the maths whatsoever, but I’d still put >95% probability on this claim.
    - dr_s 1 Sep 2025 5:55 UTC
      3 points
      0
      Parent
      So to add on this:
      
      they may have chosen this way because it turns out taking the derivative of a matrix logarithm without certain guarantees of commutativity of the matrix with its own differential is really really hard. Which to be fair isn’t a good reason per se, but yeah.
      
      Also, the paper mentions that
      
      the Kullback–Leiber divergence [7, 10], other f -divergences including Pearson divergence and Hellinger distance [34], zero-one loss [35], or the mean-square error of an estimation [36, 37]
      
      and looking at it, the quantum fidelity reduces to one minus the Hellinger distance squared:
      
      https://en.wikipedia.org/wiki/Hellinger_distance
      
      So it’s not in theory any worse or better than picking the K-L divergence, since all seem like a valid starting point; however it makes sense that this might be worth some further questioning.
      
      EDIT: in addition, due to the nature of the matrix logarithm, the quantum K-L divergence has some serious drawbacks. It’s basically the equivalent of the classic ones actually—if $Q (x, y)$ (the distribution at the denominator) is ever zero, the divergence goes to infinity. In quantum terms, that’s if any one of the eigenvalues of $σ$ is zero. So I think it’s possible that they saw this as simply not well-behaved enough to be worth using.
    - dr_s 31 Aug 2025 17:28 UTC
      3 points
      0
      Parent
      No, I don’t think there’s anything like that. I do wonder about deriving the same result for the divergence. I have no idea how hard that would be; it might even be quite easy. Possibly even reduces to something more Bayes-like in case of commutating operators. I’ll try.
gjm 31 Aug 2025 15:53 UTC
4 points
1
The title advertises a quantum version of Bayes’ rule, but so far as I can tell the actual post never explicitly presents one. Am I missing something?
- dr_s 31 Aug 2025 17:26 UTC
  3 points
  0
  Parent
  The actual formula is in the paper. I explained the process that it is obtained from. The formula for the posterior looks quite abstruse, required me to explain more notation and ultimately doesn’t give any particular useful intuitions on its face so I omitted it. You can also find it in my code.
  - gjm 31 Aug 2025 22:15 UTC
    2 points
    4
    Parent
    Fair enough! I think the article would be improved by making this a bit more explicit somehow.
    - dr_s 1 Sep 2025 5:05 UTC
      3 points
      0
      Parent
      That’s all right, thanks for the feedback—I’ve added a section with the formula proper!
- James Camacho 31 Aug 2025 16:57 UTC
  3 points
  0
  Parent
  I think the title is fine. The post mostly reads, “if you want a quantum analogue, here’s the path to take”.
Pretentious Penguin 27 Nov 2025 5:46 UTC
3 points
0
a density matrix whose off-diagonal elements are all zero is “decohered”, and can be considered the classical limit of this. A decohered density matrix behaves exactly like a classical distribution, and follows classic Markovian dynamics;
I don’t think this bullet point is accurate. Any pure state will have all its off-diagonal elements be zero in a basis where that state is one of the basis vectors, but it’s not fair to say that any pure state “behaves exactly like a classical distribution”. I suppose it would be more accurate to say that a state whose off-diagonal entries are all zero in some basis will look classical with respect to dynamics and measurements in that basis, but that concept is hard to explain unless the idea of observables corresponding to Hermitian operators has already been explained.
[ ]
[deleted]