This is a very thought provoking post. As far as I understand, it is an attempt of finding a unified theory of entropy.
I admit I am still somewhat confused about this topic. This is partly because of my insufficient grasp of the material in this post, but, I think, also partly because you didn’t yet went quite far enough with your unification.
One point is the thinking in terms of “states”. A macrostate is said to be a set of microstates. As far as I understand, the important thing here is that all microstates are presumed to be mutually exclusive, such that for any two microstates x and y the probability of their conjunction is zero, i.e. p(x∧y)=0. But that seems to be not a necessary restriction in general in order to speak about the entropy of something. So it seems better to use as the bearers of probability a Boolean algebra of propositions (or “events”, as they are called in Boolean algebras of sets), just as in probability theory. Propositions are logical combinations of other propositions, e.g. disjunctions, conjunctions, and negations. A macrostates is then just a disjunction of microstates, where the microstates are mutually exclusive. But the distinction between micro- and macrostates is no longer necessary, since with a Boolean algebra we are not confined to mutually exclusive disjunctions.
Moreover, a microstate may be considered a macrostate itself, if it can itself be represented as a disjunction of other microstates. For example, the macrostate “The die comes up even” consists of a disjunction of the microstates “The die comes up 2“, “The die comes up 4”, “The die comes up 6”. But “The die comes up 2” consists of a disjunction of many different orientations the die may land on the table. So the distinction between macrostates and microstates apparently cannot be upheld in an absolute way anyway, only in a relative way, which again means we have to treat the entropy of all propositions in a unified manner.
[We may perhaps think of fundamental “microstates” as (descriptions of) “possible worlds”, or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible world is the actual world), every proposition can be seen as a disjunction of such possible worlds: the worlds in which the proposition is true. But arguably this is not necessarily what have in mind with “microstates”.]
So what should a theory of entropy do with a Boolean algebra of propositions? Plausibly it should define which mathematical properties the entropy function S over a Boolean algebra has to satisfy—just like the three axioms of probability theory define what mathematical properties a probability function p (a “probability distribution”) has to satisfy. More precisely, entropy uses itself a probability function, so entropy theory has to be an extension of probability theory. We need to add axioms—or definitions, if we treat entropy as reducible to probability.
One way to begin is to add an axiom for mutually exclusive disjunction, i.e. for S(A∨B) where p(A∧B)=0, for two propositions A and B. You did already suggest something related to this: A formula for the expected entropy of such a disjunction. The expected value of a proposition is simply its value multiplied by its probability. And the expected value of a mutually exclusive disjunction is the sum of the expected values of the disjuncts. So:
E[S(A∨B)]=E[S(A)]+E[S(B)]if p(A∧B)=0.
This is equivalent to
p(A∨B)S(A∨B)=p(A)S(A)+p(B)S(B)if p(A∧B)=0.
In order to get the entropy (instead of the expected entropy) we just have to divide by p(A∨B):
S(A∨B)=p(A)S(A)+p(B)S(B)p(A∨B)if p(A∧B)=0.
Which, because of mutual exclusivity, is equivalent to
S(A∨B)=p(A)S(A)+p(B)S(B)p(A)+p(B)if p(A∧B)=0.
A quantity of this form is called a weighted average. So expected value and weighted average are not generally the same. They are only the same when p(A∨B)=1, which need not be the case. For example, not all “macrostates” have probability one. All we can in general say about “macrostates” and “microstates” is that the entropy of a macrostate is the weighted average of the entropies of its microstates, while the expected entropy of a macrostate is the sum of the expected entropies of its microstates.
Another way to write the weighted average is in terms of conditional probability instead of as a fraction:
S(A∨B)=p(A|A∨B)S(A)+p(B|A∨B)S(B)if p(A∧B)=0.
There is an interesting correspondence to utility theory here: The entropy of a proposition being the weighted average entropy of its mutually exclusive disjuncts is analogous to the utility of a proposition being the weighted average utility of its mutually exclusive disjuncts. In Richard Jeffrey’s utility theory the weighted average formula is called “desirability axiom”. Another analogy is apparently the role of tautologies: As in utility theory, it makes sense to add an axiom stating that the entropy of a tautology is zero, i.e.
S(⊤)=0.
In fact, those two axioms (together with the axioms of probability theory) are sufficient for utility theory. But they allow for negative values. Negative utilities make sense in utility theory, but entropy is strictly non-negative. We could add an axiom for that, corresponding to the non-negativity axiom in probability theory:
∀AS(A)≥0.
But I’m still not sure this would fully characterize entropy. After all, you also propose that entropy of a microstate is equivalent to its information content / self-information / surprisal, which is defined purely in terms of probability. But since it seems we have to give up the microstate/macrostate distinction in favor of a Boolean algebra of propositions, this would have to apply to all propositions. I.e.
S(A)=log1p(A).
This would require that
S(A∨B)=p(A)S(A)+p(B)S(B)p(A)+p(B)=log1p(A)+p(B)if p(A∧B)=0.
I’m not sure under which conditions this would be true, but it seems to be a pretty restrictive assumption. You said that information content is minimal (expected?) entropy. It is not clear why we should assume that entropy is always minimal.
It seems to me we are not yet quite there with our unified entropy theory...
I think macrostates are really a restricted kind of probability distribution, not a kind of proposition. But they’re the kind of distribution p_A that’s uniform over a particular disjunction A of microstates (and zero elsewhere), and I think people often equivocate between the disjunction A and the distribution p_A.
[EDIT: “macrostate” is a confusing term, my goal here is really to distinguish between A and p_A, whatever you want to call them]
In general, though, I think macrostates aren’t fundamental and you should just think about distributions if you’re comfortable with them.
I think microstates should indeed be considered completely-specified possible worlds, from this perspective.
“Average entropy” / “entropy of a macrostate” in OP (“entropy” in standard usage) is a function from probability distributions to reals. Shannon came up with an elegant set of axioms for this function, which I don’t remember offhand, but which uniquely pins down the expectation of -log(p(microstate)) as the entropy function (up to a constant factor).
“Entropy of a microstate” in OP (“surprisal” in information theory, no standard name otherwise AFAIK) is a function from probability distributions to random variables, which is equal to -log(p(microstate)).
So I guess I’m not sure propositions play that much of a role in the usual definition of entropy.
On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did.
Then “average entropy”/”entropy” of a macrostate p_A is S(True) under the distribution p_A, and “entropy”/”surprisal” of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A.
By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution.
I’ve never seen an exploration of what happens if you apply this S() to anything other than individual microstates or True, so I’m pretty curious to see what you come up with if you keep thinking about this!
This is more complicated if there’s a continuous space of “microstates”, of course. In that case the sigma algebra isn’t built up from disjunctions of individual events/microstates, and you need to define entropy relative to some density function as the expectation of -log(density). It’s possible there’s a nice way to think about that in terms of propositions.
On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did.
Then “average entropy”/”entropy” of a macrostate p_A is S(True) under the distribution p_A, and “entropy”/”surprisal” of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A.
By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution.
I think I was a little confused about your comment and leapt to one possible definition of S() which doesn’t satisfy all the desiderata you had. Also, I don’t like my definition anymore, anyway.
Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.
First things first:
We may perhaps think of fundamental “microstates” as (descriptions of) “possible worlds”, or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible world is the actual world), every proposition can be seen as a disjunction of such possible worlds: the worlds in which the proposition is true.
I think this is indeed how we should think of “microstates”. (I don’t want to use the word “macrostate” at all, at this point.)
I was thinking of something like: given a probability distribution p and a proposition A, define
“S(A) under p” = ∑x∈Ap(x)(−logp(x))∑x∈Ap(x)
where the sums are over all microstates x in A. Note that the denominator is equal to p(A).
I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or S(A)=E[(−logp)|A], but I think “log p” was not clearly “log p(x) for a microstate x” in my previous comment.
I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A.
I used T to mean a tautology (in this context: the full set of microstates).
Then I pointed out a couple consequences:
Typically, when people talk about the “entropy of a macrostate A”, they mean something equal to log|A|. Conceptually, this is based on the calculation ∑x∈A1|A|(−log1|A|), which is the same as either “S(A) under p_A” (in my goofy notation) or “S(T) under p_A”, but I was claiming that you should think of it as the latter.
The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to “S(T) under p” in this notation.
Finally, for a microstate x in any distribution p, we get that “S({x}) under p” is equal to -log p(x).
All of this satisfied my goals of including the most prominent concepts in Alex’s post:
log |A| for a macrostate A
Shannon/Gibbs entropy of a distribution p
-log p(x) for a microstate x
And a couple other goals:
Generalizing the Shannon/Gibbs entropy, which is S(p)=Ex∼p[−logp(x)], in a natural way to incorporate a proposition A (by making the expectation into a conditional expectation)
Not doing too much violence to the usual meaning of “entropy of macrostate A” or “the entropy of p” in the process
But it did so at the cost of:
making “the entropy of macrostate A” and “S(A) under p” two different things
contradicting standard terminology and notation anyway
reinforcing the dependence on microstates and the probabilities of microstates, contrary to what you wanted to do
So I would probably just ignore it and do your own thing.
Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems.
One is: we can gesture at them, but we can’t grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can’t use then for anything psychological like subjective Bayesianism. But maybe that’s not your goal anyway.
A more general problem is that there are infinitely many possible worlds. There are even infinitely many where snow is white. This means it is unclear how we should define a uniform probability distribution over them. Naively, if 1∞ is 0, their probabilities do not sum to 1, and if it is larger than 0, they sum to infinity. Either option would violate the probability axioms.
Warning: long and possibly unhelpful tangent ahead
Wittgenstein’s solution for this and other problems (in the Tractatus) was to ignore possible worlds and instead regard “atomic propositions” as basic. Each proposition is assumed to be equivalent to a finite logical combination of such atomic propositions, where logical combination means propositional logic (i.e. with connectives like not, and, or, but without quantifiers). Then the a priori probability of a proposition is defined as the rows in its truth table where the proposition is true divided by the total number of rows. For example, for a and b atomic, the proposition a∨b has probability 3⁄4, while a∧b has probability 1/4: The disjunction has three out of four possible truth-makers - (true, true), (true, false), (false, true), while the conjunction has only one - (true, true).
This definition in terms of the ratio of true rows in the “atomicized” truth-table is equivalent to the assumption that all atomic propositions have probability 1⁄2 and that they are all probabilistically independent.
Wittgenstein did not do it, but we can then also definite a measure of information content (or surprisal, or entropy, or whatever we want to call it) of propositions, in the following way:
Each atomic proposition has information content 1.
The information content of the conjunction of two atomic propositions is additive.
The information content of a tautology is 0.
So for a conjunction of atomic propositions with length n the information content of that conjunction is n (1+1+1+...=n), while its probability is 2−n (1/2×1/2×1/2×...=2−n). Generalizing this to arbitrary (i.e. possibly non-atomic) propositions A, the relation between probability p and information content i is
2−i(A)=p(A)
or, equivalently,
i(A)=−log2p(A).
Now that formula sure looks familiar!
The advantage of Wittgenstein’s approach is that we can assign an a priori probability distribution to propositions without having to assume a uniform probability distribution over possible worlds. It is assumed that each proposition is only a finite logical combination of atomic propositions, which would avoid problems with infinity. The same thing holds for information content (or “entropy” if you will).
Problem is … it is unclear what atomic propositions are. Wittgenstein did believe in them, and so did Bertrand Russell, but Wittgenstein eventually gave up the idea. To be clear, propositions expressed by sentences like “Snow is white” are not atomic in Wittgenstein’s sense. “Snow is white” is not probabilistically independent of “Snow is green”, and it doesn’t necessarily seem to have a priori probability 1⁄2. Moreover, the restriction to propositional logic is problematic. If we assume quantifiers, Wittgenstein suggested that we interpret the universal quantifier “all” as a possibly infinite conjunction of atomic propositions, and the existential quantifier “some” as a possibly infinite disjunction of atomic propositions. But that leads again to problems with infinity. It would always give the former probability 0 and the latter probability 1.
So logical atomism may be just as dead an end as possible worlds, perhaps worse. But it is somewhat interesting to note that approaches like algorithmic complexity have similar issues. We may want to assign a string of bits a probability or a complexity (an entropy? an information content?), but we may also want to say that some such string corresponds to a proposition, e.g. a hypothesis we are interested in. There is some superficial way of associating a binary string with propositional formulas, by interpreting e.g.1001 as a conjunction a∧¬b∧¬c∧d. But there likewise seems to be no room for quantifiers in this interpretation.
I guess a question is what you want to do with your entropy theory. Personally I would like to find some formalization of Ockham’s razor which is applicable to Bayesianism. Here the problems mentioned above appear fatal. Maybe for your purposes the issues aren’t as bad though?
This is a very thought provoking post. As far as I understand, it is an attempt of finding a unified theory of entropy.
I admit I am still somewhat confused about this topic. This is partly because of my insufficient grasp of the material in this post, but, I think, also partly because you didn’t yet went quite far enough with your unification.
One point is the thinking in terms of “states”. A macrostate is said to be a set of microstates. As far as I understand, the important thing here is that all microstates are presumed to be mutually exclusive, such that for any two microstates x and y the probability of their conjunction is zero, i.e. p(x∧y)=0. But that seems to be not a necessary restriction in general in order to speak about the entropy of something. So it seems better to use as the bearers of probability a Boolean algebra of propositions (or “events”, as they are called in Boolean algebras of sets), just as in probability theory. Propositions are logical combinations of other propositions, e.g. disjunctions, conjunctions, and negations. A macrostates is then just a disjunction of microstates, where the microstates are mutually exclusive. But the distinction between micro- and macrostates is no longer necessary, since with a Boolean algebra we are not confined to mutually exclusive disjunctions.
Moreover, a microstate may be considered a macrostate itself, if it can itself be represented as a disjunction of other microstates. For example, the macrostate “The die comes up even” consists of a disjunction of the microstates “The die comes up 2“, “The die comes up 4”, “The die comes up 6”. But “The die comes up 2” consists of a disjunction of many different orientations the die may land on the table. So the distinction between macrostates and microstates apparently cannot be upheld in an absolute way anyway, only in a relative way, which again means we have to treat the entropy of all propositions in a unified manner.
[We may perhaps think of fundamental “microstates” as (descriptions of) “possible worlds”, or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible world is the actual world), every proposition can be seen as a disjunction of such possible worlds: the worlds in which the proposition is true. But arguably this is not necessarily what have in mind with “microstates”.]
So what should a theory of entropy do with a Boolean algebra of propositions? Plausibly it should define which mathematical properties the entropy function S over a Boolean algebra has to satisfy—just like the three axioms of probability theory define what mathematical properties a probability function p (a “probability distribution”) has to satisfy. More precisely, entropy uses itself a probability function, so entropy theory has to be an extension of probability theory. We need to add axioms—or definitions, if we treat entropy as reducible to probability.
One way to begin is to add an axiom for mutually exclusive disjunction, i.e. for S(A∨B) where p(A∧B)=0, for two propositions A and B. You did already suggest something related to this: A formula for the expected entropy of such a disjunction. The expected value of a proposition is simply its value multiplied by its probability. And the expected value of a mutually exclusive disjunction is the sum of the expected values of the disjuncts. So: E[S(A∨B)]=E[S(A)]+E[S(B)]if p(A∧B)=0. This is equivalent to p(A∨B)S(A∨B)=p(A)S(A)+p(B)S(B)if p(A∧B)=0. In order to get the entropy (instead of the expected entropy) we just have to divide by p(A∨B): S(A∨B)=p(A)S(A)+p(B)S(B)p(A∨B)if p(A∧B)=0. Which, because of mutual exclusivity, is equivalent to S(A∨B)=p(A)S(A)+p(B)S(B)p(A)+p(B)if p(A∧B)=0. A quantity of this form is called a weighted average. So expected value and weighted average are not generally the same. They are only the same when p(A∨B)=1, which need not be the case. For example, not all “macrostates” have probability one. All we can in general say about “macrostates” and “microstates” is that the entropy of a macrostate is the weighted average of the entropies of its microstates, while the expected entropy of a macrostate is the sum of the expected entropies of its microstates.
Another way to write the weighted average is in terms of conditional probability instead of as a fraction: S(A∨B)=p(A|A∨B)S(A)+p(B|A∨B)S(B)if p(A∧B)=0.
There is an interesting correspondence to utility theory here: The entropy of a proposition being the weighted average entropy of its mutually exclusive disjuncts is analogous to the utility of a proposition being the weighted average utility of its mutually exclusive disjuncts. In Richard Jeffrey’s utility theory the weighted average formula is called “desirability axiom”. Another analogy is apparently the role of tautologies: As in utility theory, it makes sense to add an axiom stating that the entropy of a tautology is zero, i.e. S(⊤)=0. In fact, those two axioms (together with the axioms of probability theory) are sufficient for utility theory. But they allow for negative values. Negative utilities make sense in utility theory, but entropy is strictly non-negative. We could add an axiom for that, corresponding to the non-negativity axiom in probability theory: ∀AS(A)≥0. But I’m still not sure this would fully characterize entropy. After all, you also propose that entropy of a microstate is equivalent to its information content / self-information / surprisal, which is defined purely in terms of probability. But since it seems we have to give up the microstate/macrostate distinction in favor of a Boolean algebra of propositions, this would have to apply to all propositions. I.e. S(A)=log1p(A). This would require that S(A∨B)=p(A)S(A)+p(B)S(B)p(A)+p(B)=log1p(A)+p(B)if p(A∧B)=0. I’m not sure under which conditions this would be true, but it seems to be a pretty restrictive assumption. You said that information content is minimal (expected?) entropy. It is not clear why we should assume that entropy is always minimal.
It seems to me we are not yet quite there with our unified entropy theory...
I think macrostates are really a restricted kind of probability distribution, not a kind of proposition. But they’re the kind of distribution p_A that’s uniform over a particular disjunction A of microstates (and zero elsewhere), and I think people often equivocate between the disjunction A and the distribution p_A.
[EDIT: “macrostate” is a confusing term, my goal here is really to distinguish between A and p_A, whatever you want to call them]
In general, though, I think macrostates aren’t fundamental and you should just think about distributions if you’re comfortable with them.
I think microstates should indeed be considered completely-specified possible worlds, from this perspective.
“Average entropy” / “entropy of a macrostate” in OP (“entropy” in standard usage) is a function from probability distributions to reals. Shannon came up with an elegant set of axioms for this function, which I don’t remember offhand, but which uniquely pins down the expectation of -log(p(microstate)) as the entropy function (up to a constant factor).
“Entropy of a microstate” in OP (“surprisal” in information theory, no standard name otherwise AFAIK) is a function from probability distributions to random variables, which is equal to -log(p(microstate)).
So I guess I’m not sure propositions play that much of a role in the usual definition of entropy.
On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did.
Then “average entropy”/”entropy” of a macrostate p_A is S(True) under the distribution p_A, and “entropy”/”surprisal” of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A.
By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution.
I’ve never seen an exploration of what happens if you apply this S() to anything other than individual microstates or True, so I’m pretty curious to see what you come up with if you keep thinking about this!
This is more complicated if there’s a continuous space of “microstates”, of course. In that case the sigma algebra isn’t built up from disjunctions of individual events/microstates, and you need to define entropy relative to some density function as the expectation of -log(density). It’s possible there’s a nice way to think about that in terms of propositions.
Could you clarify this part?
I think I don’t understand your notation here.
I think I was a little confused about your comment and leapt to one possible definition of S() which doesn’t satisfy all the desiderata you had. Also, I don’t like my definition anymore, anyway.
Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.
First things first:
I think this is indeed how we should think of “microstates”. (I don’t want to use the word “macrostate” at all, at this point.)
I was thinking of something like: given a probability distribution p and a proposition A, define
“S(A) under p” = ∑x∈Ap(x)(−logp(x))∑x∈Ap(x)
where the sums are over all microstates x in A. Note that the denominator is equal to p(A).
I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or S(A)=E[(−logp)|A], but I think “log p” was not clearly “log p(x) for a microstate x” in my previous comment.
I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A.
I used T to mean a tautology (in this context: the full set of microstates).
Then I pointed out a couple consequences:
Typically, when people talk about the “entropy of a macrostate A”, they mean something equal to log|A|. Conceptually, this is based on the calculation ∑x∈A1|A|(−log1|A|), which is the same as either “S(A) under p_A” (in my goofy notation) or “S(T) under p_A”, but I was claiming that you should think of it as the latter.
The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to “S(T) under p” in this notation.
Finally, for a microstate x in any distribution p, we get that “S({x}) under p” is equal to -log p(x).
All of this satisfied my goals of including the most prominent concepts in Alex’s post:
log |A| for a macrostate A
Shannon/Gibbs entropy of a distribution p
-log p(x) for a microstate x
And a couple other goals:
Generalizing the Shannon/Gibbs entropy, which is S(p)=Ex∼p[−logp(x)], in a natural way to incorporate a proposition A (by making the expectation into a conditional expectation)
Not doing too much violence to the usual meaning of “entropy of macrostate A” or “the entropy of p” in the process
But it did so at the cost of:
making “the entropy of macrostate A” and “S(A) under p” two different things
contradicting standard terminology and notation anyway
reinforcing the dependence on microstates and the probabilities of microstates, contrary to what you wanted to do
So I would probably just ignore it and do your own thing.
Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems.
One is: we can gesture at them, but we can’t grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can’t use then for anything psychological like subjective Bayesianism. But maybe that’s not your goal anyway.
A more general problem is that there are infinitely many possible worlds. There are even infinitely many where snow is white. This means it is unclear how we should define a uniform probability distribution over them. Naively, if 1∞ is 0, their probabilities do not sum to 1, and if it is larger than 0, they sum to infinity. Either option would violate the probability axioms.
Warning: long and possibly unhelpful tangent ahead
Wittgenstein’s solution for this and other problems (in the Tractatus) was to ignore possible worlds and instead regard “atomic propositions” as basic. Each proposition is assumed to be equivalent to a finite logical combination of such atomic propositions, where logical combination means propositional logic (i.e. with connectives like not, and, or, but without quantifiers). Then the a priori probability of a proposition is defined as the rows in its truth table where the proposition is true divided by the total number of rows. For example, for a and b atomic, the proposition a∨b has probability 3⁄4, while a∧b has probability 1/4: The disjunction has three out of four possible truth-makers - (true, true), (true, false), (false, true), while the conjunction has only one - (true, true).
This definition in terms of the ratio of true rows in the “atomicized” truth-table is equivalent to the assumption that all atomic propositions have probability 1⁄2 and that they are all probabilistically independent.
Wittgenstein did not do it, but we can then also definite a measure of information content (or surprisal, or entropy, or whatever we want to call it) of propositions, in the following way:
Each atomic proposition has information content 1.
The information content of the conjunction of two atomic propositions is additive.
The information content of a tautology is 0.
So for a conjunction of atomic propositions with length n the information content of that conjunction is n (1+1+1+...=n), while its probability is 2−n (1/2×1/2×1/2×...=2−n). Generalizing this to arbitrary (i.e. possibly non-atomic) propositions A, the relation between probability p and information content i is 2−i(A)=p(A) or, equivalently, i(A)=−log2p(A). Now that formula sure looks familiar!
The advantage of Wittgenstein’s approach is that we can assign an a priori probability distribution to propositions without having to assume a uniform probability distribution over possible worlds. It is assumed that each proposition is only a finite logical combination of atomic propositions, which would avoid problems with infinity. The same thing holds for information content (or “entropy” if you will).
Problem is … it is unclear what atomic propositions are. Wittgenstein did believe in them, and so did Bertrand Russell, but Wittgenstein eventually gave up the idea. To be clear, propositions expressed by sentences like “Snow is white” are not atomic in Wittgenstein’s sense. “Snow is white” is not probabilistically independent of “Snow is green”, and it doesn’t necessarily seem to have a priori probability 1⁄2. Moreover, the restriction to propositional logic is problematic. If we assume quantifiers, Wittgenstein suggested that we interpret the universal quantifier “all” as a possibly infinite conjunction of atomic propositions, and the existential quantifier “some” as a possibly infinite disjunction of atomic propositions. But that leads again to problems with infinity. It would always give the former probability 0 and the latter probability 1.
So logical atomism may be just as dead an end as possible worlds, perhaps worse. But it is somewhat interesting to note that approaches like algorithmic complexity have similar issues. We may want to assign a string of bits a probability or a complexity (an entropy? an information content?), but we may also want to say that some such string corresponds to a proposition, e.g. a hypothesis we are interested in. There is some superficial way of associating a binary string with propositional formulas, by interpreting e.g.1001 as a conjunction a∧¬b∧¬c∧d. But there likewise seems to be no room for quantifiers in this interpretation.
I guess a question is what you want to do with your entropy theory. Personally I would like to find some formalization of Ockham’s razor which is applicable to Bayesianism. Here the problems mentioned above appear fatal. Maybe for your purposes the issues aren’t as bad though?