Karma: 4
• No worries :) Thanks a lot for your help! Much appreciated.

It’s amazing how complex a simple coin flipping problem can get when we approach it from our paradigm of objective Bayesianism. Professor Jaynes remarks on this after deriving the principle of indifference: “At this point, depending on your personality and background in this subject, you will be either greatly impressed or greatly disappointed by the result (2.91).”—page 40

A frequentist would have “solved“ this problem rather easily. Personally, I would trade simplicity for coherence any day of the week...

• Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it.

This is what I think “estimating the probability” means:

We define theta to be a real-world, objective, physical parameter/​quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:

1. “frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you measure temperature.

Using the above definitions, we can compute the likelihood and then the posterior and then the posterior predictive which is represents the probability of heads given data from previous flips.

Is the above accurate?

So Bayesians who say that theta is the probability of heads and compute a point estimate of the parameter theta and say that they have “estimated the probability” are just frequentists in disguise?

• I‘m afraid I have to disagree. I do sometimes regret not focusing more on applied Bayesian inference. (In fact, I have no idea what WAIC or HMC is.) But in my defence, I am an amateur philosopher & logician and I couldn’t help finding more non-sequiturs in statistics textbooks than plot-holes in Tolkien novels. Perhaps if had been more naive and less critical (no offence to anyone) when I read those books, I would have “progressed” faster. I had lost hope in understanding statistics before I read Professor Jaynes’ book; that’s why I respect the man so much. Now I have the intuition but I am still trying to reconcile it with what I read in the applied literature. I also sometimes find it frustrating that I am worrying about philosophical nuances and intricacies while others are applying their (perhaps less coherent) understanding of statistics to solve problems but I guess it is worth it :)

• I believe it is the same thing. A uniform prior means your prior is constant function i.e. P(A_p|I) = x where x is a real number with the usual caveats. So if you have a uniform prior, you can drop it (from a safe height of course). But perhaps the more seasoned Bayesians disagree? (where are they when you need them)

• You are right; dropping priors in the A_p distribution is probably not a general rule. Perhaps the propositions don’t always need to interpretable for us to be able impose priors? For example, people impose priors over the parameter space of a neural network which is certainly not interpretable. But the topic of Bayesian neural networks is beyond me

• To calculate the posterior predictive you need to calculate the posterior and to the calculate posterior you need to calculate the likelihood (in most problems). For the coin flipping example, what is the probability of heads and what is the probability of tails given that the frequency is equal to some value theta? You might accuse me of being completely devoid of intuition for asking this question but please bear with me...

Sounds good. I thought nobody was interested in reading Professor Jaynes’ book anymore. It’s a shame more people don’t know about him

• “[…] A_p the distribution over how often the coin will come up heads […]”—I understood A_p to be a sort of distribution over models; we do not know/​talk about the model itself but we know that if a model A_p is true, then the probability of heads is equal to p by definition of A_p. Perhaps the model A_p is the proposition “the centre of mass of the coin is at p” or “the bias-weighting of the coin is p” but we do not care as long the resulting probability of heads is p. So how can the prior not be indifferent when we do not know the nature of each proposition A_p in a set of mutually exclusive and exhaustive propositions?

• I dropped the prior for two reason:

1. I assumed the background information to be indifferent to the A_p’s

2. We do not explicitly talk about the nature of the A_p’s. Prof. Jaynes defines it as a proposition such that P(A|A_p, E) = p. In my example A_p is defined as a proposition such that P(H|A_p, I) = p. No matter what prior information we have, it is going to be indifferent to the A_p’s by virtue of the fact that we don’t know what A_p represents

Is this justification valid?

• Response to point one: I do find that to be satisfactory from a philosophical perspective but only because theta refers to a real-world property called frequency and not the probability of heads. My question to you is this: if you have a point estimate of theta or if you find the exact real world-value of theta (perhaps by measuring it with an ACME frequency-o-meter), what does it tell you about the probability of heads?

Response to point two: The honour is mine :) If you ever create a study group or discord server for the book, then please count me in

• Thank you so much for telling me about A_p distribution! This is exactly what I have been looking for.

“Pending a better understanding of what that means, let us adopt a cautious notation that will avoid giving possibly wrong impressions. We are not claiming that P(Ap|E) is a ‘real probability’ in the sense that we have been using that term; it is only a number which is to obey the mathematical rules of probability theory. Perhaps its proper conceptual meaning will be clearer after getting a little experience using it. So let us refrain from using the prefix symbol p; to emphasize its more abstract nature, let us use the bare bracket symbol notation (Ap|E) to denote such quantities, and call it simply ‘the density for Ap, given E’.”—Page 554 of Professor Jaynes’ book

The idea of the A_p distribution not being a real probability distribution but obeying the mathematical rules of probability theory is far too nuanced and intricate for me to be able to understand.

I was reading an article on this site about the A_p distribution, Probability, knowledge, and meta-probability, and a commenter wrote:

“I think a much better approach is to assign models to the problem (e.g. “it’s a box that has 100 holes, 45 open and 65 plugged, the machine picks one hole, you get 2 coins if the hole is open and nothing if it’s plugged.”), and then have a probability distribution over models. This is better because keeps probabilities assigned to facts about the world.

It’s true that probabilities-of-probabilities are just an abstraction of this (when used correctly), but I’ve found that people get confused really fast if you ask them to think in terms of probabilities-of-probabilities. (See every confused discussion of “what’s the standard deviation of the standard deviation?”)“

I would appreciate your thoughts on this. My current understanding of A_p distributions in light of this comment and in the context of coin flipping is this:

is defined to be a proposition such that and where & represents heads & tails and represents the background information. This is similar to the definition Professor Jaynes gives in page 554 of his book.

Let , the data, be .

Using this definition, the posterior is .

Assuming the background information is indifferent to the ’s:

*

Therefore in the set of propositions {}, the most plausible proposition given our data is . Each member of this set of propositions is called a model. The probability of heads given the most plausible model is 1.0

Is this a correct understanding?

• I am very grateful for your answer but I have a few contentions from my paradigm of objective Bayesianism

1. You have replaced probability with a physical property: “frequency“. I have also seen other people use terms like bias-weighting, fairness, center of mass, etc. which are all properties of the coin, to sidestep this question. I have nothing against theta being a physical property such that P(heads|theta=alpha) = alpha. In fact, it would make a ton of sense to me if this actually were the case. But the issue is when people say that theta is a probability and treat it as if it was a physical property. I presume you don’t view probabilities to be physical properties. Even subjective Bayesians are not that evil...

2. “if Janes does not have access to the data that formed his prior or cannot explain it well, then what he believes about the coin and what the alien believes about the coin are both ‘rational’, as it is the posterior from their personal priors and the shared data.” If Professor Jaynes did not have access to the data that formed his prior, his prior would have been the same as the alien’s and they would have ended up with the same posterior. There is no such thing as a “personal prior”. I invite you to the light side: read Professor Jaynes’ book; it is absolutely brilliant

• So you are saying that “we” are uncertain about the degree of belief/​plausibility that what our brain is going to assign? Then who are “we” exactly? Apologies for being glib but I really don’t understand

Also, it is a crime to have different priors given the same information according to us objective Bayesians so that can’t be the issue

• These subjective Bayesians… :) I feel the same way about that statement. Could you please elaborate?

• What is the theoretical justification behind taking the mean? Argmax feels more intuitive for me because it is literally “the most plausible value of theta”. In either case, whether we use argmax or mean, can we prove that it is equal to P(H|D)?

• I believe, mathematically, your claim can be expressed as:

=

where is the ”probability“ parameter of the Bernoulli distribution, H represents the the proposition that heads occurs, and D represents our data. The left side of this equation is the plausibility based on knowledge and the right side is Professor Jaynes’ ‘estimate of the probability’ . How can we prove this statement?

Edit:

Latex is being a nuisance as usual :) The right side of the equation is the argmax with respect to theta of P(theta | data)

• Appreciate your reply. I think the source of my confusion is there being uncertainty in the degree of plausibility that we assign given our knowledge or there being uncertainty in our degree of belief given our knowledge. This feels a bit unnatural to me because this quantity is not an external/​physical and unknown quantity but one that we assign given our knowledge. If we were to think of probabilities as physical properties that are unknown, then it makes sense to me that there can uncertainty in its value. How would you reconcile this?

# [Question] Jay­ne­sian in­ter­pre­ta­tion—How does “es­ti­mat­ing prob­a­bil­ities” make sense?

21 Jul 2021 21:36 UTC
2 points