The “Classical derivation” made more sense to me after translating it to standard probability notation, so I’m commenting to share the “dictionary” I made for it, as well as the unexpected extra assumption I had to make.
The obvious:
γ(x)=P[X=x]
φ(y|x)=P[Y=y|X=x]
^φ(x|y)=P[X=x|Y=y]
It got tricky with τ. Instead of observing Y=y, we observe something else that gives us a probability distribution over Y. I considered this “something else” to be the value of some other unknown: Z=z. The probability distribution over y is a conditional distribution:
τ(y)=P[Y=y|Z=z]
Hate to have z on only one side like that… maybe I should have called it τz… but I’ll leave it as is.
Then,
γ′(x)=∑jP[X=x|Y=yj]P[Y=yj|Z=z]
Not quite the right formula for a simple interpretation of γ′… if only
P[X=x|Y=yj]=P[X=x|Y=yj,Z=z]
This is conditional independence, which could be represented with this Bayes net:
Z→Y→X
Then, we have
γ′(x)=P[X=x|Z=z]
That completes the dictionary.
So to do what feels like ordinary probability theory, I had to introduce this extra unknown Z so that we have something to observe, and then to assume that Z only provides information about Y (and indirectly about X, through Y).
The way you described τ as some probability distribution resulting from an observation, but not a conditional distribution, is in philosophy called Jeffrey conditionalization.
The Stanford Encyclopedia of Philosophy gives this example:
A gambler is very confident that a certain racehorse, called Mudrunner, performs exceptionally well on muddy courses. A look at the extremely cloudy sky has an immediate effect on this gambler’s opinion: an increase in her credence in the proposition (muddy) that the course will be muddy—an increase without reaching certainty. Then this gambler raises her credence in the hypothesis (win) that Mudrunner will win the race, but nothing becomes fully certain. (Jeffrey 1965 [1983: sec. 11.3])
The idea is, we go from one probability distribution over {muddy,¬muddy} to another, without becoming certain of anything.
My introduction of Z corresponds to introducing an unknown representing the status of the sky.
I would say we are conditioning on Z=cloudy.
I recalled vaguely that Jaynes discussed Jeffrey conditionalization in Probability Theory, and criticized it for holding only in a special case.
I took a look, and sure enough, it’s in section 5.6, and he’s pointing out exactly what I did, right down to the arrows, though he calls it a “logic flow diagram” rather than identifying it as a Pearl-style Bayes net.
I don’t think you necessarily need a Z though. My interpretation of that step was “suppose we know X is a constant but hidden reality, and Y is observable. Then we perform N experiments and measure the resulting Y, and thus characterise a probability distribution of it. How does that inform our guess on X?”. And yeah, that could be mediated by a third variable, but it doesn’t need to. If X is “the coin is fair, or the coin is loaded to land 75% of the time on heads” and Y is “the result of a coin toss”, you get a better (lower entropy) belief distribution on X after several tosses.
Thanks for this by the way, I used the paper’s notation but agree it was a bit confusing so this probably helps people!
Yeah… well, I thought of the Z because it sounds like we’re getting the probabilities of Y from some experiment.
So Z=z is the results of the experiment, which in this case is a vector of frequencies.
When I put it like that, it sounds like it’s is just a rhetorical device for saying that we have given probabilities of Y.
But I still seem to need Z for my dictionary.
I have γ(x)=P[X=x].
What is γ′(x)?
It is some kind of updated probability of X=x, right?
Like we went from one probability to the other by doing an experiment.
If I didn’t write γ′(x)=P[X|Z=z], I’d need something like γ(x)=P1[X=x] and γ′(x)=P2[X=x].
Reading again, it seems like this is exactly Jeffrey conditionalization.
So whether you include some extra variable just depends on what you think of Jeffrey conditionalization.
I feel like I’m missing something, though, about what this experiment is and means.
For example, I’m not totally clear on whether we have one state X, and a collection of replicates of state Y; or is it a collection of replicates of (X,Y) pairs?
Looking at the paper, I see the connection to Jeffrey conditionalization is made explicitly.
And it mentions Pearl’s “virtual evidence method”; is this what he calls introducing this Z?
But no clarity on exactly what this experiment is.
It just says:
But how should the above be generalized to the situation where the new information does not come in the form of a definite value y0 for Y, but as “soft evidence,” i.e., a probability distribution τ(y)?”
By the way, regarding your coin toss example, I can at least say how this is handled in Bayesian statistics.
There are separate random variables for each coin toss.
Y1 is the first, Y2 is the second, etc.
If you have n coin tosses, then your sample is a vector →Y containing Y1 to Yn.
Then the posterior probability is P[loaded|→Y=→y].
This will be covered in any Bayesian statistics textbook as “the Bernoulli model”.
My class used Hoff’s book, which provides a quick start.
I guess this example suggests a single unknown X (whether the coin is loaded or not) and replicates of Y.
Yes, I’m aware of the Bernoulli model—my point is that the vector →Y is itself the outcome of that experiment; I suppose you can call it Z though it makes the notation a bit confusing. The general point is that yes, you update your belief about X based on a series of outcomes on Y. In fact I think in general γ′(x)=P[X=x|→Y] works just fine.
The “Classical derivation” made more sense to me after translating it to standard probability notation, so I’m commenting to share the “dictionary” I made for it, as well as the unexpected extra assumption I had to make.
The obvious:
γ(x)=P[X=x]
φ(y|x)=P[Y=y|X=x]
^φ(x|y)=P[X=x|Y=y]
It got tricky with τ. Instead of observing Y=y, we observe something else that gives us a probability distribution over Y. I considered this “something else” to be the value of some other unknown: Z=z. The probability distribution over y is a conditional distribution:
τ(y)=P[Y=y|Z=z]
Hate to have z on only one side like that… maybe I should have called it τz… but I’ll leave it as is.
Then,
γ′(x)=∑jP[X=x|Y=yj]P[Y=yj|Z=z]
Not quite the right formula for a simple interpretation of γ′… if only
P[X=x|Y=yj]=P[X=x|Y=yj,Z=z]
This is conditional independence, which could be represented with this Bayes net:
Z→Y→X
Then, we have
γ′(x)=P[X=x|Z=z]
That completes the dictionary.
So to do what feels like ordinary probability theory, I had to introduce this extra unknown Z so that we have something to observe, and then to assume that Z only provides information about Y (and indirectly about X, through Y).
The way you described τ as some probability distribution resulting from an observation, but not a conditional distribution, is in philosophy called Jeffrey conditionalization. The Stanford Encyclopedia of Philosophy gives this example:
The idea is, we go from one probability distribution over {muddy,¬muddy} to another, without becoming certain of anything. My introduction of Z corresponds to introducing an unknown representing the status of the sky. I would say we are conditioning on Z=cloudy.
I recalled vaguely that Jaynes discussed Jeffrey conditionalization in Probability Theory, and criticized it for holding only in a special case. I took a look, and sure enough, it’s in section 5.6, and he’s pointing out exactly what I did, right down to the arrows, though he calls it a “logic flow diagram” rather than identifying it as a Pearl-style Bayes net.
I don’t think you necessarily need a Z though. My interpretation of that step was “suppose we know X is a constant but hidden reality, and Y is observable. Then we perform N experiments and measure the resulting Y, and thus characterise a probability distribution of it. How does that inform our guess on X?”. And yeah, that could be mediated by a third variable, but it doesn’t need to. If X is “the coin is fair, or the coin is loaded to land 75% of the time on heads” and Y is “the result of a coin toss”, you get a better (lower entropy) belief distribution on X after several tosses.
Thanks for this by the way, I used the paper’s notation but agree it was a bit confusing so this probably helps people!
Yeah… well, I thought of the Z because it sounds like we’re getting the probabilities of Y from some experiment. So Z=z is the results of the experiment, which in this case is a vector of frequencies. When I put it like that, it sounds like it’s is just a rhetorical device for saying that we have given probabilities of Y.
But I still seem to need Z for my dictionary. I have γ(x)=P[X=x]. What is γ′(x)? It is some kind of updated probability of X=x, right? Like we went from one probability to the other by doing an experiment. If I didn’t write γ′(x)=P[X|Z=z], I’d need something like γ(x)=P1[X=x] and γ′(x)=P2[X=x].
Reading again, it seems like this is exactly Jeffrey conditionalization. So whether you include some extra variable just depends on what you think of Jeffrey conditionalization.
I feel like I’m missing something, though, about what this experiment is and means. For example, I’m not totally clear on whether we have one state X, and a collection of replicates of state Y; or is it a collection of replicates of (X,Y) pairs?
Looking at the paper, I see the connection to Jeffrey conditionalization is made explicitly. And it mentions Pearl’s “virtual evidence method”; is this what he calls introducing this Z? But no clarity on exactly what this experiment is. It just says:
By the way, regarding your coin toss example, I can at least say how this is handled in Bayesian statistics. There are separate random variables for each coin toss. Y1 is the first, Y2 is the second, etc. If you have n coin tosses, then your sample is a vector →Y containing Y1 to Yn. Then the posterior probability is P[loaded|→Y=→y]. This will be covered in any Bayesian statistics textbook as “the Bernoulli model”. My class used Hoff’s book, which provides a quick start.
I guess this example suggests a single unknown X (whether the coin is loaded or not) and replicates of Y.
Yes, I’m aware of the Bernoulli model—my point is that the vector →Y is itself the outcome of that experiment; I suppose you can call it Z though it makes the notation a bit confusing. The general point is that yes, you update your belief about X based on a series of outcomes on Y. In fact I think in general γ′(x)=P[X=x|→Y] works just fine.