I am sure you are already aware of this, but the conserved quantities we see come from symmetries in the function space (Noether’s theorem). The question I think you should be asking is, how do we extend this to random variables in information theory?
I am not sure of The Answer™, but I have an answer and I believe it is The Answer™: with the information bottleneck. Suppose we have a map f:X→Y from some bits in the world to some properties we care about. In my head, I’m using the example of X=MNIST digits and Y=their classifications. If there is a symmetry in X that f(X) is invariant to, this means we should lose no predictive information by transforming a sample x according to that symmetry
I(Y;X)=I(Y;gX)∀g∈Gsymmetry⟹I(Y;X)=I(Y;X/G)
If we already know the symmetry we can force our predictor to be invariant to it. This is most commonly seen in chemistry models where they choose graph neural networks that are invariant to vertex and edge relabeling. However, usually it is hard to specify the exact symmetries (how do you code a “seven” is symmetric to stretching out its leg a little?), and the symmetries may also not be exact (a “seven” can bleed into a “one” by shortening its arm). The solution is to first run X through an autoencoder model that automatically finds these exact symmetries, and the inexact ones up to whatever bit precision you care about.
If we replace the exact X/G by the autoencoder g(X), then
I(Y;X)=I(Y;X/G)≈I(Y;g(X))≤I(Y;X)
means we want to maximize the mutual information between Y (the property we care about, like “sevenness”) and g(X). Also,
H(g(X))≈H(X/G)=H(X)−H(G),
meaning the more symmetries g(X) captures, the smaller its entropy should be. Since X probably has some fixed entropy, this is equivalent to minimizing the mutual information between X and g(X). Together, we have a tradeoff between maximizing I(Y;g(X)) and minimizing I(g(X);X), which is just the information bottleneck:
maxI(Y;g(X))−βI(g(X);X).
The larger the β the more “stochastic symmetries” are eliminated, which means it gets closer to the essence of “sevenness” or whatever properties are in Y, but further from saying anything else about X. The fun thing you can do is make X and Ythe same entity, and now you are getting the essence of X unsupervised (e.g., with MNIST, though it does have reconstruction loss too).
Finally, a little evidence that seems to align with autoencoders being the solution comes from adversarial robustness. For a decade or so, it was believed that generalization and adversarial robustness are counter to one another. This seems a little ridiculous to me now, but then again I was not old enough to be aware of the problem before the myth was dispelled (this myth has been dispelled, right? People today know that generalization and robustness are essentially the same problem, right?). Anyway, the way everyone was training for adversarial robustness is they took the training images, perturbed them as little as possible to get the model to mispredict (adversarial backprop), and then trained on these new images. This ended up making the generalization error worse (“Robustness May Be at Odds with Accuracy”). It turns out if you just first autoencode the images or use a GAN to keep the perturbations on-manifold, then it generalizes better (“Disentangling Adversarial Robustness and Generalization
”). Almost like it captured the ontology better.
… the projection on orbits of a symmetry group’s action can be seen as an information-preserving compression, as it preserves the information about anything invariant under the group action. This suggests that projections on orbits might be solutions to well-chosen rate-distortion problems, hence opening the way to the integration of group symmetries into an information-theoretic framework. If successful, such an integration could formalise the link between symmetry and information parsimony, but also (i) yield natural ways to “soften” group symmetries into flexible concepts more relevant to real-world data — which often lacks exact symmetries despite exhibiting a strong “structure” — and (ii) enable symmetry discovery through the optimisation of information-theoretic quantities.
First: Yes, this post seems to essentially be about thermodynamics and either way it is salient to immediately bring up symmetry. So agree on that point.
Symmetry, thermodynamics, information theory and ontology happen to be topics I take interest in (as stated in my LW bio).
Now, James, for your approach, I would like to understand better what you are saying here, and what you are actually claiming. Could you dumb this down or make it clearer? What scope/context do you intend for this approach? How far do you take it? And how much have you thought about it?
The tricky part when parsing John’s post is understanding what he means by “insensitive functions.” He doesn’t define it anywhere, and I think it’s because he was pointing at an idea but didn’t yet have a good definition for it. However, the example he gives—conservation of energy—occurs because the laws of physics are insensitive to some kind of symmetry, in this particular case time-translation. I’ve been thinking a lot about the relationship symmetries + physics + information theory this past year or two, and you can see some of my progress here and here. To me, it felt kind of natural to jump to “insensitive functions” being a sort of stochastic symmetry in the data.
I haven’t fleshed out exactly what that means. For exact symmetries, we can break up the data into a symmetry-invariant piece and the symmetry factor
x=(^x,g).
However, it feels like in real data, there is not such a clean separation. It’s closer to something like this: we could write x “big-edian” form, so that we get finer and finer details about x as we read off more bits. My guess is there is an “elbow” in the importance of the bits, similar to how Chebyshev series have this elbow in coefficient magnitude that chebfun identifies and chops off for quicker calculations:
In fact, as I’m writing this, I just realized that your autoencoder model could just be a discrete cosine (Chebyshev series) transform. It won’t be the best autoencoder to exist, but it is what JPEG uses. Anyway, I think the “arm”—or the bits to the left of the elbow—seems to form a natural ontology. The bits to the left of it seem to be doing something to help describe x, but not the bits to the right.
How does this relate to symmetries? Well, an exact symmetry is cleanly separable, which means the bits could be added after every other bit—it’s far to the right of the elbow. Chopping off the elbow does satisfy our idea of “ontology” in the exact symmetry case. Then all we need to do is create a model that chops off those uninteresting bits. The parameter β in the information bottleneck pretty much specifies a chopping point. The first term, I(Y;g(X)), says to keep important bits, while the second term, I(g(x);X) says to cut out unimportant bits, and β specifies at what point bits become too unimportant to leave in. You can slowly increase β until things start catastrophically failing (e.g. validation loss goes down), at which point you’ve probably identified the elbow.
No, the reason why we should have insensitivity is not quite that. As an example suppose I have a long random bitstring of fair coinflips X_1,X_2, etc. Now say that the “laws of physics” randomly turns over some number of coins—but for some reason can only turn over an even number. This gives us variables Y_1, Y_2, … at the second timestep. That is, P(Xs, Ys) ∝ 2^{-sum(xor(Xs,Ys))} if xorsum(xor(Xs,Ys)) == 0 else 0
where xor is bitwise xor, sum is sum of strings bits (as an integer, e.g. sum(110)=2), xorsum is sum of string’s bits mod 2, and it’s only proportional because we threw away the ones that changed an odd number of bits.
There’s a symmetry for the distribution of Xs: we can NOT any bit without affecting the distribution. Therefore it must be the uniform distribution—which we already knew, of course.
Likewise, the marginal distribution P(Ys) is symmetric under flipping any bit, since P(Xs, singleflip(Ys)) = P(singleflip(Xs), Ys) and so after summing over X we’ll just get a reindexing of our sum. So the Ys are also given by a uniform distribution. We could repeat our ‘laws of physics’, getting a sequence of bitstrings. The marginal distribution over the bitstrings at any time is still uniform.
The xorsum of the bitstring at any time is conserved over time. Yet it’s certainly not ‘insensitive’! If you look at some time, you cannot just use it as a latent for the distribution at that time, because you fail both mediation and redundancy.
I still think it’s likely that you can connect symmetries to redundantly encoded information, but it can’t be done this way.
I don’t understand your objection. What do you mean here?
“If you look at some time, you cannot just use [the xorsum] as a latent for the distribution at that time, because you fail both mediation and redundancy.”
Which distribution? The distribution of is uniform. You do not need any information to perfectly recover this distribution at a particular timestep. Do you mean the joint distribution ? If is very large, then
where the distribution is uniform for the sets of bitstrings that share parity (xorsum) with , and zero for the other half. So parity here seems to be doing exactly the same thing that made energy a natural latent.
Parity is not a latent, because the information is not redundantly expressed/not insensitive.
Furthermore, after conditioning on the parity there will now be some mutual information between one of the bits and all the others, in that it’s the max value of 1 bit (since knowing all the others and the parity allows you to figure out the last). Thus we have the worst KL-error for mediation.
In both of these, I’m talking about the distribution of the bitstring at a certain time.
I’m not sure what your limit is supposed to mean—Y here is not always taking the bits at the same timestep, right? But then, why divide by n?
Why divide by ? This is the relative entropy rate. And actually, I was wrong, it does not go to zero. I’m still not sure I understand your objection though. It feels like it applies equally well to why energy should not be a natural latent?
Consider a set of gas particles trapped in a box with given positions and velocities. There is a little bit of randomness due to the box vibrating a due to its temperature. You can mostly predict where the gas particles will be from one time step to another. The energy though, does not help with this prediction. It completely fails at mediating between time steps or reducing the complexity of the particle positions/velocities. Even if you know a couple of the particle positions/velocities, the energy isn’t going to help you find the rest.
The mediation condition is that when you condition on the latent, the mutual information between any one variable and the joint distribution of all other variables is low. In the case of the energy and temperature, once you know the energy and temperature, all the variables are now independent, and so you get no mutual information. However, with the parity, the rest of the variables let you figure out the last, so we fail mediation.
For redundancy, the energy and temperature is for the most part determined by any (n-1) variable subset, becaues averages. This isn’t true of the parity—the last bit being 50⁄50 means you still have total uncertainty over the parity.
You clearly have some idea of what “mediation” and “redundancy” means for these particular scenarios and why they matter. I still have no clue what you mean by those words, why I should care about these properties, or how they related to the notion of insensitivity.
Ah, I was talking about the conditions for natural latents, the main research program of the post author. See this post for a good math intro containing those definitions.
I now have the definitions, but I still don’t see the relation to insensitivity. Yes, natural latents are natural ontologies, but natural ontologies are not necessarily natural latents.
At the very least, the stochastic redund condition feels like a pretty minimal version of what ‘insensitivity’ could mean. The parity is still pretty maximally insensitive—if you’re trying to reduce your uncertainty about what the parity is, learning about (n-1) bits doesn’t even help you until you learn the last one! I doubt a good definition of “insensitivity” would call the parity insensitive.
What do you mean by “the stochastic redund condition”? Here’s what I feel like you’re doing: you have some unformalized intuitions. It seems to be the case that ‘insensitive’ stuff matches your intuition about redundancy for uncontrived examples. You then went and contrived an example where it didn’t match your intuition.
If I were in your situation, I would conclude, “my intuition is missing something, let me try to formalize this and see where I went wrong.”
I’m still really confused by your opening salvo:
“No, the reason why we should have insensitivity is not quite that.”
What do you mean??? What is “that”, what is “the reason why we should have insensitivity”? I think the reason we should have insensitivity is so the oracle can make predictions.
Also, I’m not going to continue responding. I do not think you have anything here. I think you are just confused, and you have not done the work to figure out what you yourself mean.
So, I feel like you just got deeper into the weeds here, thinking aloud. This seems interesting. I am trying to parse, but there is not enough formal context to make it make sense to me.
My main question was anyway, what w/could you use it for? What is the scope/context?
(Making some light banter) Maybe you are american, so I need to “debate” you to make it more obvious. “James, this is all a nice theoretical concept, but it seems useless practically. In its current form, I don’t see how it could be used for anything important”.
Haha, I did initially start with trying to be more explanatory but that ended after a few sentences. Where I think this could immediately improve a lot of models is by replacing the VAEs everyone is using in diffusion models with information bottleneck autoencoders. In short: VAEs are viruses. In long: VAEs got popular because they work decently well, but they are not theoretically correct. Their paper gestures at a theoretical justification, but it settles for less than is optimal. They do work better than vanilla autoencoders, because they “splat out” encodings which lets you interpolate between datapoints smoothly, and this is why everyone uses them today. If you ask most people using them, they will tell you it’s “industry standard” and “the right way to do things, because it is industry standard.” An information bottleneck autoencoder also ends up “splatting out” encodings, but has the correct theoretical backing. My expectations are you will automatically get things like finer details and better instruction following (“the table is on the apple”), because bottleneck encoders have more pressure to conserve encoding bits for such details.
There are probably a few other places this would be useful—for example, in LLM autoregression, you should try to minimize the mutual information between the embeddings and the previous tokens—but I have yet to do any experiments in other places. This is because estimating the mutual information is hard and makes training more fragile.
In terms of just philosophy, well I don’t particularly care for just the subject of philosophy. Philosophers too often assign muddy meanings to words and wonder why they’re confused ten propositions in. My goal when interacting with such sophistry is usually to define the words and figure out what that entails. I think philosophers just do not have the mathematical training to put into words what they mean, and even with that training it’s hard to do and will often be wrong. For example, I do not think the information bottleneck is a proper definition of “ontology” but is closer to “describing an ontology”. It does not say why something is the way it is, but it helps you figure out what it is. It’s a way to find natural ontologies, but does not say anything about how they came to be.
Thank you, just knowing you are strictly coming from a ML perspective already helps a lot. This was not obvious to me, who have approached these topics more from a physics lens.
// So, addressing your implementation ideas, this approach is practically speaking pretty neat! I lack formal ML background to properly evaluate it, but it seems neat.
Now, I will try to succinctly decipher the theory behind your core idea, and you let me know how I do.
You propose compressing data into a form that preserves the core identity. It gives us something practical we can work with.
The elbow has variables that break symmetry to the left and variables that hold symmetry to the right. This is an important distinction between from* noise and signal that I think many miss.
*mended, edit
This is all context dependent? Context defines the curve, the Beta parameter.
// How did I do?
Note: I should say at this point, understanding fundamental reality is my lifelong quest (constantly ignored in order to live out my little side quests) and I care about this topic. This quest, is what ontology means in the classical, and philosophical sense. When I speak about ontology in AI-context, I usually mean formal representations of reality, not induced ones. You seem to use AI context but mean induced ontologies.
The ‘ontology as insensitivity’ concept described by johnswentworth is interesting, and basically follows from statistical mechanics. But it is perhaps missing the inherent symmetry aspect, or something replacing it, as a fundamental factor. You can’t remove all symmetry. Everything with identity exists within a symmetry. This is non-obvious and partly my own assertion, but looking at modern group theory, this is indeed how mathematics define objects and so I am supported within this framework.
If we take wentworth’s idea and your elbow analogy, and try to define an object within a formal ontology, within my framework that all objects exist within symmetries, then we get:
Concept=Total RealitySymmetries (The Tail)
The “Elbow” doesn’t mark where reality ends and noise begins. It marks the resolution limit of your current context.
To the left of the elbow: Information that matters (Differences).
To the right of the elbow: Information that doesn’t matter (Equivalences/Symmetries).
Your example was a hand-written digit “7”. The Tail is the symmetries. You can slant the digit, thicken the line, or shift it left. These are the symmetries. As long as the variation stays in the “tail” of the curve, the identity “7” is preserved. (Note that the identity is relative and context dependent).
The Elbow: This is the breaking point. If you bend the top horizontal line too much, it becomes a “1“. You have left the chosen symmetry group of “7” and entered the chosen symmetry group of “1”.
If so, I would be genuinely curious to hear your ideas here. This might be an actually powerful concept if it holds up and you can formalize it properly. I assume you are an engineer, not a scientist? I think this idea deserves some deep thinking.
Something which I think highly relevant, and which might inform your GAN discussion, is the difference in performance of W-GAN—basically if you train the GAN using an optimal transport metric instead of an information theoretic one, it seems to have much better robustness properties, and this is probably because shannon entropy doesn’t respect continuity of your underlying metric space (e.g. KL divergence between Delta(x0) and Delta(x0 + epsilon) is infinity for any nonzero epsilon, so it doesnt capture ‘closeness’). I don’t yet know how I think this should tie into the high-probability latent manifold story you tell, but it seems like part of it.
Perhaps you mean something else—but I am pretty sure you’re first statement
for all g in G: I(Y;X) = I(Y;gX)
⇒ I(Y;X) = I(Y;X/G)
is obviously false if P(Y,X/G) = \sum_g P(Y, gX) (that is, the probability of being in the orbit).
Here’s a counterexample: Say that X is a bit taken from a fair coinflip and Y is NOT(X). I(Y;X) = 1 and if we take the group action that just applies NOT again we see that I(Y;gX) = 1 as well. But X/G is just a single element and so I(Y;X) has to be 0 (you don’t learn anything when something with probability one happens!)
That is, just because X tells you just as much about Y as NOT(X) doesn’t mean that you could’ve learned just as much from knowing that you got either X or NOT(X), because duh.
And what is H(G) supposed to be? We aren’t randomly drawing from G.
Yes, I do mean something else. Not sure how to write it cleanly. If you look at the dual definition of the mutual information
you have some program trying to discriminate if comes from the joint or factored distributions. The idea is if we made the task “harder”, force it to try to discriminate while also arbitrarily multiplying by elements from , it can do just as well. Perhaps the right way to write this down is
where we sample
for the joint and
for the factored distribution.
EDIT: Here’s one way of formalizing this. Define a distribution as an element of with augmentation one. The notation is . If acts on , we can smear the distribution with a distribution :
I am sure you are already aware of this, but the conserved quantities we see come from symmetries in the function space (Noether’s theorem). The question I think you should be asking is, how do we extend this to random variables in information theory?
I am not sure of The Answer™, but I have an answer and I believe it is The Answer™: with the information bottleneck. Suppose we have a map f:X→Y from some bits in the world to some properties we care about. In my head, I’m using the example of X=MNIST digits and Y=their classifications. If there is a symmetry in X that f(X) is invariant to, this means we should lose no predictive information by transforming a sample x according to that symmetry
I(Y;X)=I(Y;gX)∀g∈Gsymmetry⟹I(Y;X)=I(Y;X/G)
If we already know the symmetry we can force our predictor to be invariant to it. This is most commonly seen in chemistry models where they choose graph neural networks that are invariant to vertex and edge relabeling. However, usually it is hard to specify the exact symmetries (how do you code a “seven” is symmetric to stretching out its leg a little?), and the symmetries may also not be exact (a “seven” can bleed into a “one” by shortening its arm). The solution is to first run X through an autoencoder model that automatically finds these exact symmetries, and the inexact ones up to whatever bit precision you care about.
If we replace the exact X/G by the autoencoder g(X), then
I(Y;X)=I(Y;X/G)≈I(Y;g(X))≤I(Y;X)
means we want to maximize the mutual information between Y (the property we care about, like “sevenness”) and g(X). Also,
H(g(X))≈H(X/G)=H(X)−H(G),
meaning the more symmetries g(X) captures, the smaller its entropy should be. Since X probably has some fixed entropy, this is equivalent to minimizing the mutual information between X and g(X). Together, we have a tradeoff between maximizing I(Y;g(X)) and minimizing I(g(X);X), which is just the information bottleneck:
maxI(Y;g(X))−βI(g(X);X).
The larger the β the more “stochastic symmetries” are eliminated, which means it gets closer to the essence of “sevenness” or whatever properties are in Y, but further from saying anything else about X. The fun thing you can do is make X and Y the same entity, and now you are getting the essence of X unsupervised (e.g., with MNIST, though it does have reconstruction loss too).
Finally, a little evidence that seems to align with autoencoders being the solution comes from adversarial robustness. For a decade or so, it was believed that generalization and adversarial robustness are counter to one another. This seems a little ridiculous to me now, but then again I was not old enough to be aware of the problem before the myth was dispelled (this myth has been dispelled, right? People today know that generalization and robustness are essentially the same problem, right?). Anyway, the way everyone was training for adversarial robustness is they took the training images, perturbed them as little as possible to get the model to mispredict (adversarial backprop), and then trained on these new images. This ended up making the generalization error worse (“Robustness May Be at Odds with Accuracy”). It turns out if you just first autoencode the images or use a GAN to keep the perturbations on-manifold, then it generalizes better (“Disentangling Adversarial Robustness and Generalization ”). Almost like it captured the ontology better.
Perhaps relevant: An Informational Parsimony Perspective on Probabilistic Symmetries (Charvin et al 2024), on applying information bottleneck approaches to group symmetries:
First: Yes, this post seems to essentially be about thermodynamics and either way it is salient to immediately bring up symmetry. So agree on that point.
Symmetry, thermodynamics, information theory and ontology happen to be topics I take interest in (as stated in my LW bio).
Now, James, for your approach, I would like to understand better what you are saying here, and what you are actually claiming. Could you dumb this down or make it clearer? What scope/context do you intend for this approach? How far do you take it? And how much have you thought about it?
The tricky part when parsing John’s post is understanding what he means by “insensitive functions.” He doesn’t define it anywhere, and I think it’s because he was pointing at an idea but didn’t yet have a good definition for it. However, the example he gives—conservation of energy—occurs because the laws of physics are insensitive to some kind of symmetry, in this particular case time-translation. I’ve been thinking a lot about the relationship symmetries + physics + information theory this past year or two, and you can see some of my progress here and here. To me, it felt kind of natural to jump to “insensitive functions” being a sort of stochastic symmetry in the data.
I haven’t fleshed out exactly what that means. For exact symmetries, we can break up the data into a symmetry-invariant piece and the symmetry factor
x=(^x,g).However, it feels like in real data, there is not such a clean separation. It’s closer to something like this: we could write x “big-edian” form, so that we get finer and finer details about x as we read off more bits. My guess is there is an “elbow” in the importance of the bits, similar to how Chebyshev series have this elbow in coefficient magnitude that
chebfunidentifies and chops off for quicker calculations:(Source: Chopping a Chebyshev Series)
In fact, as I’m writing this, I just realized that your autoencoder model could just be a discrete cosine (Chebyshev series) transform. It won’t be the best autoencoder to exist, but it is what JPEG uses. Anyway, I think the “arm”—or the bits to the left of the elbow—seems to form a natural ontology. The bits to the left of it seem to be doing something to help describe x, but not the bits to the right.
How does this relate to symmetries? Well, an exact symmetry is cleanly separable, which means the bits could be added after every other bit—it’s far to the right of the elbow. Chopping off the elbow does satisfy our idea of “ontology” in the exact symmetry case. Then all we need to do is create a model that chops off those uninteresting bits. The parameter β in the information bottleneck pretty much specifies a chopping point. The first term, I(Y;g(X)), says to keep important bits, while the second term, I(g(x);X) says to cut out unimportant bits, and β specifies at what point bits become too unimportant to leave in. You can slowly increase β until things start catastrophically failing (e.g. validation loss goes down), at which point you’ve probably identified the elbow.
No, the reason why we should have insensitivity is not quite that. As an example suppose I have a long random bitstring of fair coinflips X_1,X_2, etc. Now say that the “laws of physics” randomly turns over some number of coins—but for some reason can only turn over an even number. This gives us variables Y_1, Y_2, … at the second timestep. That is, P(Xs, Ys) ∝ 2^{-sum(xor(Xs,Ys))} if xorsum(xor(Xs,Ys)) == 0 else 0
where xor is bitwise xor, sum is sum of strings bits (as an integer, e.g. sum(110)=2), xorsum is sum of string’s bits mod 2, and it’s only proportional because we threw away the ones that changed an odd number of bits.
There’s a symmetry for the distribution of Xs: we can NOT any bit without affecting the distribution. Therefore it must be the uniform distribution—which we already knew, of course.
Likewise, the marginal distribution P(Ys) is symmetric under flipping any bit, since P(Xs, singleflip(Ys)) = P(singleflip(Xs), Ys) and so after summing over X we’ll just get a reindexing of our sum. So the Ys are also given by a uniform distribution. We could repeat our ‘laws of physics’, getting a sequence of bitstrings. The marginal distribution over the bitstrings at any time is still uniform.
The xorsum of the bitstring at any time is conserved over time. Yet it’s certainly not ‘insensitive’! If you look at some time, you cannot just use it as a latent for the distribution at that time, because you fail both mediation and redundancy.
I still think it’s likely that you can connect symmetries to redundantly encoded information, but it can’t be done this way.
I don’t understand your objection. What do you mean here?
“If you look at some time, you cannot just use [the xorsum] as a latent for the distribution at that time, because you fail both mediation and redundancy.”
Which distribution? The distribution of is uniform. You do not need any information to perfectly recover this distribution at a particular timestep. Do you mean the joint distribution ? If is very large, then
where the distribution is uniform for the sets of bitstrings that share parity (xorsum) with , and zero for the other half. So parity here seems to be doing exactly the same thing that made energy a natural latent.
Parity is not a latent, because the information is not redundantly expressed/not insensitive.
Furthermore, after conditioning on the parity there will now be some mutual information between one of the bits and all the others, in that it’s the max value of 1 bit (since knowing all the others and the parity allows you to figure out the last). Thus we have the worst KL-error for mediation.
In both of these, I’m talking about the distribution of the bitstring at a certain time.
I’m not sure what your limit is supposed to mean—Y here is not always taking the bits at the same timestep, right? But then, why divide by n?
Why divide by ? This is the relative entropy rate. And actually, I was wrong, it does not go to zero. I’m still not sure I understand your objection though. It feels like it applies equally well to why energy should not be a natural latent?
The mediation condition is that when you condition on the latent, the mutual information between any one variable and the joint distribution of all other variables is low. In the case of the energy and temperature, once you know the energy and temperature, all the variables are now independent, and so you get no mutual information. However, with the parity, the rest of the variables let you figure out the last, so we fail mediation.
For redundancy, the energy and temperature is for the most part determined by any (n-1) variable subset, becaues averages. This isn’t true of the parity—the last bit being 50⁄50 means you still have total uncertainty over the parity.
You clearly have some idea of what “mediation” and “redundancy” means for these particular scenarios and why they matter. I still have no clue what you mean by those words, why I should care about these properties, or how they related to the notion of insensitivity.
Ah, I was talking about the conditions for natural latents, the main research program of the post author. See this post for a good math intro containing those definitions.
I now have the definitions, but I still don’t see the relation to insensitivity. Yes, natural latents are natural ontologies, but natural ontologies are not necessarily natural latents.
At the very least, the stochastic redund condition feels like a pretty minimal version of what ‘insensitivity’ could mean. The parity is still pretty maximally insensitive—if you’re trying to reduce your uncertainty about what the parity is, learning about (n-1) bits doesn’t even help you until you learn the last one! I doubt a good definition of “insensitivity” would call the parity insensitive.
What do you mean by “the stochastic redund condition”? Here’s what I feel like you’re doing: you have some unformalized intuitions. It seems to be the case that ‘insensitive’ stuff matches your intuition about redundancy for uncontrived examples. You then went and contrived an example where it didn’t match your intuition.
If I were in your situation, I would conclude, “my intuition is missing something, let me try to formalize this and see where I went wrong.”
I’m still really confused by your opening salvo:
“No, the reason why we should have insensitivity is not quite that.”
What do you mean??? What is “that”, what is “the reason why we should have insensitivity”? I think the reason we should have insensitivity is so the oracle can make predictions.
Also, I’m not going to continue responding. I do not think you have anything here. I think you are just confused, and you have not done the work to figure out what you yourself mean.
So, I feel like you just got deeper into the weeds here, thinking aloud. This seems interesting. I am trying to parse, but there is not enough formal context to make it make sense to me.
My main question was anyway, what w/could you use it for? What is the scope/context?
(Making some light banter) Maybe you are american, so I need to “debate” you to make it more obvious. “James, this is all a nice theoretical concept, but it seems useless practically. In its current form, I don’t see how it could be used for anything important”.
Haha, I did initially start with trying to be more explanatory but that ended after a few sentences. Where I think this could immediately improve a lot of models is by replacing the VAEs everyone is using in diffusion models with information bottleneck autoencoders. In short: VAEs are viruses. In long: VAEs got popular because they work decently well, but they are not theoretically correct. Their paper gestures at a theoretical justification, but it settles for less than is optimal. They do work better than vanilla autoencoders, because they “splat out” encodings which lets you interpolate between datapoints smoothly, and this is why everyone uses them today. If you ask most people using them, they will tell you it’s “industry standard” and “the right way to do things, because it is industry standard.” An information bottleneck autoencoder also ends up “splatting out” encodings, but has the correct theoretical backing. My expectations are you will automatically get things like finer details and better instruction following (“the table is on the apple”), because bottleneck encoders have more pressure to conserve encoding bits for such details.
There are probably a few other places this would be useful—for example, in LLM autoregression, you should try to minimize the mutual information between the embeddings and the previous tokens—but I have yet to do any experiments in other places. This is because estimating the mutual information is hard and makes training more fragile.
In terms of just philosophy, well I don’t particularly care for just the subject of philosophy. Philosophers too often assign muddy meanings to words and wonder why they’re confused ten propositions in. My goal when interacting with such sophistry is usually to define the words and figure out what that entails. I think philosophers just do not have the mathematical training to put into words what they mean, and even with that training it’s hard to do and will often be wrong. For example, I do not think the information bottleneck is a proper definition of “ontology” but is closer to “describing an ontology”. It does not say why something is the way it is, but it helps you figure out what it is. It’s a way to find natural ontologies, but does not say anything about how they came to be.
Thank you, just knowing you are strictly coming from a ML perspective already helps a lot. This was not obvious to me, who have approached these topics more from a physics lens.
//
So, addressing your implementation ideas, this approach is practically speaking pretty neat! I lack formal ML background to properly evaluate it, but it seems neat.
Now, I will try to succinctly decipher the theory behind your core idea, and you let me know how I do.
You propose compressing data into a form that preserves the core identity. It gives us something practical we can work with.
The elbow has variables that break symmetry to the left and variables that hold symmetry to the right. This is an important distinction
betweenfrom* noise and signal that I think many miss.*mended, edit
This is all context dependent? Context defines the curve, the Beta parameter.
// How did I do?
Note: I should say at this point, understanding fundamental reality is my lifelong quest (constantly ignored in order to live out my little side quests) and I care about this topic. This quest, is what ontology means in the classical, and philosophical sense. When I speak about ontology in AI-context, I usually mean formal representations of reality, not induced ones. You seem to use AI context but mean induced ontologies.
The ‘ontology as insensitivity’ concept described by johnswentworth is interesting, and basically follows from statistical mechanics. But it is perhaps missing the inherent symmetry aspect, or something replacing it, as a fundamental factor. You can’t remove all symmetry. Everything with identity exists within a symmetry. This is non-obvious and partly my own assertion, but looking at modern group theory, this is indeed how mathematics define objects and so I am supported within this framework.
If we take wentworth’s idea and your elbow analogy, and try to define an object within a formal ontology, within my framework that all objects exist within symmetries, then we get:
Concept=Total RealitySymmetries (The Tail)
The “Elbow” doesn’t mark where reality ends and noise begins. It marks the resolution limit of your current context.
To the left of the elbow: Information that matters (Differences).
To the right of the elbow: Information that doesn’t matter (Equivalences/Symmetries).
Your example was a hand-written digit “7”. The Tail is the symmetries. You can slant the digit, thicken the line, or shift it left. These are the symmetries. As long as the variation stays in the “tail” of the curve, the identity “7” is preserved. (Note that the identity is relative and context dependent).
The Elbow: This is the breaking point. If you bend the top horizontal line too much, it becomes a “1“. You have left the chosen symmetry group of “7” and entered the chosen symmetry group of “1”.
This is mostly correct, though I think there are phase changes making some β more natural than others.
If so, I would be genuinely curious to hear your ideas here. This might be an actually powerful concept if it holds up and you can formalize it properly. I assume you are an engineer, not a scientist? I think this idea deserves some deep thinking.
I don’t have any more thoughts on this at present, and I probably won’t think too much on it in the future, as it isn’t super interesting to me.
Something which I think highly relevant, and which might inform your GAN discussion, is the difference in performance of W-GAN—basically if you train the GAN using an optimal transport metric instead of an information theoretic one, it seems to have much better robustness properties, and this is probably because shannon entropy doesn’t respect continuity of your underlying metric space (e.g. KL divergence between Delta(x0) and Delta(x0 + epsilon) is infinity for any nonzero epsilon, so it doesnt capture ‘closeness’). I don’t yet know how I think this should tie into the high-probability latent manifold story you tell, but it seems like part of it.
Perhaps you mean something else—but I am pretty sure you’re first statement
for all g in G: I(Y;X) = I(Y;gX)
⇒ I(Y;X) = I(Y;X/G)
is obviously false if P(Y,X/G) = \sum_g P(Y, gX) (that is, the probability of being in the orbit).
Here’s a counterexample: Say that X is a bit taken from a fair coinflip and Y is NOT(X). I(Y;X) = 1 and if we take the group action that just applies NOT again we see that I(Y;gX) = 1 as well. But X/G is just a single element and so I(Y;X) has to be 0 (you don’t learn anything when something with probability one happens!)
That is, just because X tells you just as much about Y as NOT(X) doesn’t mean that you could’ve learned just as much from knowing that you got either X or NOT(X), because duh.
And what is H(G) supposed to be? We aren’t randomly drawing from G.
Yes, I do mean something else. Not sure how to write it cleanly. If you look at the dual definition of the mutual information
you have some program trying to discriminate if comes from the joint or factored distributions. The idea is if we made the task “harder”, force it to try to discriminate while also arbitrarily multiplying by elements from , it can do just as well. Perhaps the right way to write this down is
where we sample
for the joint and
for the factored distribution.
EDIT: Here’s one way of formalizing this. Define a distribution as an element of with augmentation one. The notation is . If acts on , we can smear the distribution with a distribution :
The symmetry condition is