Haha, I did initially start with trying to be more explanatory but that ended after a few sentences. Where I think this could immediately improve a lot of models is by replacing the VAEs everyone is using in diffusion models with information bottleneck autoencoders. In short: VAEs are viruses. In long: VAEs got popular because they work decently well, but they are not theoretically correct. Their paper gestures at a theoretical justification, but it settles for less than is optimal. They do work better than vanilla autoencoders, because they “splat out” encodings which lets you interpolate between datapoints smoothly, and this is why everyone uses them today. If you ask most people using them, they will tell you it’s “industry standard” and “the right way to do things, because it is industry standard.” An information bottleneck autoencoder also ends up “splatting out” encodings, but has the correct theoretical backing. My expectations are you will automatically get things like finer details and better instruction following (“the table is on the apple”), because bottleneck encoders have more pressure to conserve encoding bits for such details.
There are probably a few other places this would be useful—for example, in LLM autoregression, you should try to minimize the mutual information between the embeddings and the previous tokens—but I have yet to do any experiments in other places. This is because estimating the mutual information is hard and makes training more fragile.
In terms of just philosophy, well I don’t particularly care for just the subject of philosophy. Philosophers too often assign muddy meanings to words and wonder why they’re confused ten propositions in. My goal when interacting with such sophistry is usually to define the words and figure out what that entails. I think philosophers just do not have the mathematical training to put into words what they mean, and even with that training it’s hard to do and will often be wrong. For example, I do not think the information bottleneck is a proper definition of “ontology” but is closer to “describing an ontology”. It does not say why something is the way it is, but it helps you figure out what it is. It’s a way to find natural ontologies, but does not say anything about how they came to be.
Thank you, just knowing you are strictly coming from a ML perspective already helps a lot. This was not obvious to me, who have approached these topics more from a physics lens.
// So, addressing your implementation ideas, this approach is practically speaking pretty neat! I lack formal ML background to properly evaluate it, but it seems neat.
Now, I will try to succinctly decipher the theory behind your core idea, and you let me know how I do.
You propose compressing data into a form that preserves the core identity. It gives us something practical we can work with.
The elbow has variables that break symmetry to the left and variables that hold symmetry to the right. This is an important distinction between from* noise and signal that I think many miss.
*mended, edit
This is all context dependent? Context defines the curve, the Beta parameter.
// How did I do?
Note: I should say at this point, understanding fundamental reality is my lifelong quest (constantly ignored in order to live out my little side quests) and I care about this topic. This quest, is what ontology means in the classical, and philosophical sense. When I speak about ontology in AI-context, I usually mean formal representations of reality, not induced ones. You seem to use AI context but mean induced ontologies.
The ‘ontology as insensitivity’ concept described by johnswentworth is interesting, and basically follows from statistical mechanics. But it is perhaps missing the inherent symmetry aspect, or something replacing it, as a fundamental factor. You can’t remove all symmetry. Everything with identity exists within a symmetry. This is non-obvious and partly my own assertion, but looking at modern group theory, this is indeed how mathematics define objects and so I am supported within this framework.
If we take wentworth’s idea and your elbow analogy, and try to define an object within a formal ontology, within my framework that all objects exist within symmetries, then we get:
Concept=Total RealitySymmetries (The Tail)
The “Elbow” doesn’t mark where reality ends and noise begins. It marks the resolution limit of your current context.
To the left of the elbow: Information that matters (Differences).
To the right of the elbow: Information that doesn’t matter (Equivalences/Symmetries).
Your example was a hand-written digit “7”. The Tail is the symmetries. You can slant the digit, thicken the line, or shift it left. These are the symmetries. As long as the variation stays in the “tail” of the curve, the identity “7” is preserved. (Note that the identity is relative and context dependent).
The Elbow: This is the breaking point. If you bend the top horizontal line too much, it becomes a “1“. You have left the chosen symmetry group of “7” and entered the chosen symmetry group of “1”.
If so, I would be genuinely curious to hear your ideas here. This might be an actually powerful concept if it holds up and you can formalize it properly. I assume you are an engineer, not a scientist? I think this idea deserves some deep thinking.
Haha, I did initially start with trying to be more explanatory but that ended after a few sentences. Where I think this could immediately improve a lot of models is by replacing the VAEs everyone is using in diffusion models with information bottleneck autoencoders. In short: VAEs are viruses. In long: VAEs got popular because they work decently well, but they are not theoretically correct. Their paper gestures at a theoretical justification, but it settles for less than is optimal. They do work better than vanilla autoencoders, because they “splat out” encodings which lets you interpolate between datapoints smoothly, and this is why everyone uses them today. If you ask most people using them, they will tell you it’s “industry standard” and “the right way to do things, because it is industry standard.” An information bottleneck autoencoder also ends up “splatting out” encodings, but has the correct theoretical backing. My expectations are you will automatically get things like finer details and better instruction following (“the table is on the apple”), because bottleneck encoders have more pressure to conserve encoding bits for such details.
There are probably a few other places this would be useful—for example, in LLM autoregression, you should try to minimize the mutual information between the embeddings and the previous tokens—but I have yet to do any experiments in other places. This is because estimating the mutual information is hard and makes training more fragile.
In terms of just philosophy, well I don’t particularly care for just the subject of philosophy. Philosophers too often assign muddy meanings to words and wonder why they’re confused ten propositions in. My goal when interacting with such sophistry is usually to define the words and figure out what that entails. I think philosophers just do not have the mathematical training to put into words what they mean, and even with that training it’s hard to do and will often be wrong. For example, I do not think the information bottleneck is a proper definition of “ontology” but is closer to “describing an ontology”. It does not say why something is the way it is, but it helps you figure out what it is. It’s a way to find natural ontologies, but does not say anything about how they came to be.
Thank you, just knowing you are strictly coming from a ML perspective already helps a lot. This was not obvious to me, who have approached these topics more from a physics lens.
//
So, addressing your implementation ideas, this approach is practically speaking pretty neat! I lack formal ML background to properly evaluate it, but it seems neat.
Now, I will try to succinctly decipher the theory behind your core idea, and you let me know how I do.
You propose compressing data into a form that preserves the core identity. It gives us something practical we can work with.
The elbow has variables that break symmetry to the left and variables that hold symmetry to the right. This is an important distinction
betweenfrom* noise and signal that I think many miss.*mended, edit
This is all context dependent? Context defines the curve, the Beta parameter.
// How did I do?
Note: I should say at this point, understanding fundamental reality is my lifelong quest (constantly ignored in order to live out my little side quests) and I care about this topic. This quest, is what ontology means in the classical, and philosophical sense. When I speak about ontology in AI-context, I usually mean formal representations of reality, not induced ones. You seem to use AI context but mean induced ontologies.
The ‘ontology as insensitivity’ concept described by johnswentworth is interesting, and basically follows from statistical mechanics. But it is perhaps missing the inherent symmetry aspect, or something replacing it, as a fundamental factor. You can’t remove all symmetry. Everything with identity exists within a symmetry. This is non-obvious and partly my own assertion, but looking at modern group theory, this is indeed how mathematics define objects and so I am supported within this framework.
If we take wentworth’s idea and your elbow analogy, and try to define an object within a formal ontology, within my framework that all objects exist within symmetries, then we get:
Concept=Total RealitySymmetries (The Tail)
The “Elbow” doesn’t mark where reality ends and noise begins. It marks the resolution limit of your current context.
To the left of the elbow: Information that matters (Differences).
To the right of the elbow: Information that doesn’t matter (Equivalences/Symmetries).
Your example was a hand-written digit “7”. The Tail is the symmetries. You can slant the digit, thicken the line, or shift it left. These are the symmetries. As long as the variation stays in the “tail” of the curve, the identity “7” is preserved. (Note that the identity is relative and context dependent).
The Elbow: This is the breaking point. If you bend the top horizontal line too much, it becomes a “1“. You have left the chosen symmetry group of “7” and entered the chosen symmetry group of “1”.
This is mostly correct, though I think there are phase changes making some β more natural than others.
If so, I would be genuinely curious to hear your ideas here. This might be an actually powerful concept if it holds up and you can formalize it properly. I assume you are an engineer, not a scientist? I think this idea deserves some deep thinking.
I don’t have any more thoughts on this at present, and I probably won’t think too much on it in the future, as it isn’t super interesting to me.