I’ll answer the second question, and hopefully the first will be answered in the process.
First, note that P[X|M2]∝eαu(X), so arbitrarily large negative utilities aren’t a problem—they get exponentiated, and yield probabilities arbitrarily close to 0. The problem is arbitrarily large positive utilities. In fact, they don’t even need to be arbitrarily large, they just need to have an infinite exponential sum; e.g. if u(X) is 1 for any whole number of paperclips X, then to normalize the probability distribution we need to divide by ∑∞X=0eα⋅1=∞. The solution to this is to just leave the distribution unnormalized. That’s what “improper distribution” means: it’s a distribution which can’t be normalized, because it sums to ∞.
The main question here seems to be “ok, but what does an improper distribution mean in terms of bits needed to encode X?”. Basically, we need infinitely many bits in order to encode X, using this distribution. But it’s “not the same infinity” for each X-value—not in the sense of “set of reals is bigger than the set of integers”, but in the sense of “we constructed these infinities from a limit so one can be subtracted from the other”. Every X value requires infinitely many bits, but one X-value may require 2 bits more than another, or 3 bits less than another, in such a way that all these comparisons are consistent. By leaving the distribution unnormalized, we’re effectively picking a “reference point” for our infinity, and then keeping track of how many more or fewer bits each X-value needs, compared to the reference point.
In the case of the paperclip example, we could have a sequence of utilities un(X) which each assigns utility X to any number of paperclips X < n (i.e. 1 util per clip, up to n clips), and then we take the limit n→∞. Then our nthunnormalized distribution is Punnorm[X|Mn]=eαXI[X<n], and the normalizing constant is Zn=1−eαn1−eα, which grows like O(eαn) as n→∞. The number of bits required to encode a particular value X<n is
Key thing to notice: the first term, log1−eαn1−eα, is the part which goes to ∞ with n, and it does not depend on X. So, we can take that term to be our “reference point”, and measure the number of bits required for any particular X relative to that reference point. That’s exactly what we’re implicitly doing if we don’t normalize the distribution: ignoring normalization, we compute the number of bits required to encode X as
… which is exactly the “adjustment” from our reference point.
(Side note: this is exactly how information theory handles continuous distributions. An infinite number of bits is required to encode a real number, so we pull out a term logdx which diverges in the limit dx→0, and we measure everything relative to that. Equivalently, we measure the number of bits required to encode up to precision dx, and as long as the distribution is smooth and dx is small, the number of bits required to encode the rest of x using the distribution won’t depend on the value of x.)
Does this make sense? Should I give a different example/use more English?