Scale invariance is itself an emergent phenomenon.
Imagine scaling something (say a physical law) up—if it changes, it is obviously not scale invariant as it will continue changing with each scale up. If it does not change it has reached a fixed point and will not change in the next scale up either! Scale invariances are just fixed points of coarse-graining. Therefore, we should expect anything we think of as scale invariant to break down at small scales. For instance, electric charge is not scale invariant at small scales! In the opposite direction: We should expect our physical laws to continue holding for the macro scale, if they are fixed points of scaling. This also explains the ubiquity of power laws in the natural sciences; power laws are the only relations that are scale invariant and thus preserved! All of this may seem tautological but is actually truly strange. To me this indicates that we should expect to be very, very far from the actual substrate of the universe.
Now go forth and study renormalisation group flow! ;)
This sounds like a fascinating insight, but I think I may be missing some physics context to fully understand.
Why is it that the derived laws approximating a true underlying physical law are expected to stay scale invariant over increasing scale after being scale invariant for two steps?
Is there a reason that there can’t be a scale invariant region that goes back to being scale variant at large enough scales just like it does at small enough scales?
The act of coarse-graining/scaling up (RG transformation) changes the theory that describes the system, specifically the theories parameters. If you consider in the space of all theories and iterate the coarse-graining, this induces a flow where each theory is mapped to a coarse-grained version. This flow may posess attractors, that is stable fixed points x*, meaning that when you apply the coarse-graining you get the same theory back.
And if f(x*)=x* then obviously f(f(x*))=x*, i.e. any repeated application will still yield the fixed point.
So you can scale up as much as you want—entering a fixed point really is a one way street, you can can check out any time you like but you can never leave!
The main source of scale-invariance itself probably would have to do with symmetry meaning that an object has a particular property that is preserved across scales.
Space symmetry is an example, where the basic physical laws are preserved across all scales of spacetime, and in particular means that scaling a system down doesn’t mean different laws of physics apply at different scales, there is only 1 physical law, which produces varied consequences at all scales.
You’re making an interesting connection to symmetry! But scale invariance as discussed here is actually emergent—it arises when theories reach fixed points under coarse-graining, rather than being a fundamental symmetry of space. This is why quantities like electric charge can change with scale, despite spacetime symmetries remaining intact.
And while spacetime symmetries still seem scale invariant, considering the above argument they might also break down at small scales. It seems exceedingly unlikely that they would not! The initial parameters of the theory would have to be chosen just so as to be a fixed point. It seems much more likely that these symmetries emerged through RG flow rather than being fundamental.
And while spacetime symmetries still seem scale invariant, considering the above argument they might also break down at small scales. It seems exceedingly unlikely that they would not! The initial parameters of the theory would have to be chosen just so as to be a fixed point. It seems much more likely that these symmetries emerged through RG flow rather than being fundamental.
While this is an interesting idea, I do still think space symmetries are likely to remain fundamental features of physics, rather than being emergent out of some other process.
Sadly my claim is somewhat unfalsifiable because the emergence might always be hiding at some smaller scale, but I would be surprised if we find the theory that the standard model emerges from and it’s contains classical spacetime.
I don’t know if that is a meaningful question. Consider this: a cube is something that is symmetric under the octahedral group—that’s what *makes* it a cube. If it wasn’t symmetric under these transformations, it wouldn’t be a cube. So also with spacetime—it’s something that transforms according to the Poincaré group (plus some other mathematical properties, metric etc.). That’s what makes it spacetime.
Simplified the solomonoff prior is the distribution you get when you take a uniform distribution over all strings and feed them to a turing machine.
Since the outputs are also strings: What happens if we iterate this? What is the stationary distribution? Is there even one? The fixed points will be quines, programs that copy their source code to the output. But how are they weighted? By their length? Presumably you can also have quine-cycles of programs that generate each other in turn, in a manner reminiscent metagenesis. Do these quine cycles capture all probability mass or does some diverge?
Very grateful for answers and literature suggestions.
A few quick observations (each with like 90% confidence; I won’t provide detailed arguments atm, but feel free to LW-msg me for more details):
Any finite number of iterates just gives you the solomonoff distribution up to at most a const multiplicative difference (with the const depending on how many iterates you do). My other points will be about the limit as we iterate many times.
The quines will have mass at least their prior, upweighted by some const because of programs which do not produce an infinite output string. They will generally have more mass than that, and some will gain mass by a larger multiplicative factor than others, but idk how to say something nice about this further.
As you do more and more iterates, there’s not convergence to a stationary distribution, at least in total variation distance. One reason is that you can write a quine which adds a string to itself (and then adds the same string again next time, and so on)[1], creating “a way for a finite chunk of probability to escape to infinity”. So yes, some mass diverges.
Quine-cycles imply (or at least very strongly suggest) probabilities also do not converge pointwise.
What about pointwise convergence when we also average over the number of iterates? It seems plausible you get convergence then, but not sure (and not sure if this would be an interesting claim). It would be true if we could somehow think of the problem as living on a directed graph with countably many vertices, but idk how to do that atm.
There are many different stationary distributions — e.g. you could choose any distribution on the quines.
Is there an anthropic reason or computational (solomonoff-pilled) argument for why we would expect to the computational/causal graph of the universe to be this local (sparse)? Or at least appear local to a first approximation. (Bells-inequality)
This seems like a quite special property: I suspect that ether
it is not as rare in e.g. the solomonoff prior as we might first intuit, or
we should expect this for anthropic resons e.g. it is really hard to develop intelligence/do precidctions in nonlocal universes.
In physics, it is sometimes asked why there should be just three (large) space dimensions. No one really knows, but there are various mathematical properties unique to three or four dimensions, to which appeal is sometimes made.
I would also consider the recent (last few decades) interest in the emergence of spatial dimensions from entanglement. It may be that your question can be answered by considering these two things together.
Any non-uniform prior inherently encodes a bias toward simplicity. This isn’t an additional assumption we need to make—it falls directly out of the mathematics.
For any hypothesis h, the information content is I(h)=−log(P(h)), which means probability and complexity have an exponential relationship: P(h)=e−I(h)
This demonstrates that simpler hypotheses (those with lower information content) are automatically assigned higher probabilities. The exponential relationship creates a strong bias toward simplicity without requiring any special mechanisms.
The “simplicity prior” is essentially tautological—more probable things are simple by definition.
You can have a hypothesis with really high kolmogorov complexity, but if the hypothesis is true 50% of the time it will require 1 bit of information to specify with respect to a coding scheme that merely points to cached hypotheses.
This is why when kolmogorov complexity is defined it’s with respect to a fixed universal description language, as otherwise you’re right, it’s vacuous to talk about the simplicity of a hypothesis.
In evolution we can tell a story that not only are genes selected for their function, but also for how easily modifiable they are. For example, having a generic antibiotic gene is much more useful than having an antibiotic locked into one target and far, in edit-distance terms, from any other useful variant.
Why would we expect the generic gene to be more common? There is selection pressure on having modifiable genes because environments are constantly shifting (the Red Queen hypothesis). Genes are modules with evolvability baked in by past selection.
Can we make a similar argument for circuits/features/modes in NNs? Obviously it is better to have a more general circuit, but can we also argue that “multitool circuits” are not only better at generalising but also more likely to be found?
SGD does not optimise loss but rather something like free energy, taking degeneracy (multiplicity) into account with some effective temperature.
But evolvability seems distinct from degeneracy. Degeneracy is a property of a single loss landscape, while evolvability is a claim about distribution shift. And the claim is not “I have low loss in the new distribution” but rather “I am very close to a low-loss solution of the new distribution.”
Degeneracy in ML ≈ mutational robustness in biology, which is straightforward, but that is not what I am pointing at here. Evolvability is closer to out-of-distribution adaptivity: the ability to move quickly into a new optimum with small changes.
Are there experiments where a model is trained on a shifting distribution?
Is the shifting distribution relevant or can this just as well be modeled as a mixture of the distributions, and what we think of as OOD is actually in the mixture distribution? In that case degeneracy is all you need.
Related ideas: cryptographic one-way functions (examples of unevolvable designs), out-of-distribution generalisation, mode connectivity.
Scale invariance is itself an emergent phenomenon.
Imagine scaling something (say a physical law) up—if it changes, it is obviously not scale invariant as it will continue changing with each scale up. If it does not change it has reached a fixed point and will not change in the next scale up either!
Scale invariances are just fixed points of coarse-graining.
Therefore, we should expect anything we think of as scale invariant to break down at small scales. For instance, electric charge is not scale invariant at small scales!
In the opposite direction: We should expect our physical laws to continue holding for the macro scale, if they are fixed points of scaling. This also explains the ubiquity of power laws in the natural sciences; power laws are the only relations that are scale invariant and thus preserved!
All of this may seem tautological but is actually truly strange. To me this indicates that we should expect to be very, very far from the actual substrate of the universe.
Now go forth and study renormalisation group flow! ;)
Epistemic status: Just riffing!
This sounds like a fascinating insight, but I think I may be missing some physics context to fully understand.
Why is it that the derived laws approximating a true underlying physical law are expected to stay scale invariant over increasing scale after being scale invariant for two steps? Is there a reason that there can’t be a scale invariant region that goes back to being scale variant at large enough scales just like it does at small enough scales?
The act of coarse-graining/scaling up (RG transformation) changes the theory that describes the system, specifically the theories parameters. If you consider in the space of all theories and iterate the coarse-graining, this induces a flow where each theory is mapped to a coarse-grained version. This flow may posess attractors, that is stable fixed points x*, meaning that when you apply the coarse-graining you get the same theory back.
And if f(x*)=x* then obviously f(f(x*))=x*, i.e. any repeated application will still yield the fixed point.
So you can scale up as much as you want—entering a fixed point really is a one way street, you can can check out any time you like but you can never leave!
The main source of scale-invariance itself probably would have to do with symmetry meaning that an object has a particular property that is preserved across scales.
Space symmetry is an example, where the basic physical laws are preserved across all scales of spacetime, and in particular means that scaling a system down doesn’t mean different laws of physics apply at different scales, there is only 1 physical law, which produces varied consequences at all scales.
You’re making an interesting connection to symmetry! But scale invariance as discussed here is actually emergent—it arises when theories reach fixed points under coarse-graining, rather than being a fundamental symmetry of space. This is why quantities like electric charge can change with scale, despite spacetime symmetries remaining intact.
And while spacetime symmetries still seem scale invariant, considering the above argument they might also break down at small scales. It seems exceedingly unlikely that they would not! The initial parameters of the theory would have to be chosen just so as to be a fixed point. It seems much more likely that these symmetries emerged through RG flow rather than being fundamental.
While this is an interesting idea, I do still think space symmetries are likely to remain fundamental features of physics, rather than being emergent out of some other process.
I’ll bet you! ;)
Sadly my claim is somewhat unfalsifiable because the emergence might always be hiding at some smaller scale, but I would be surprised if we find the theory that the standard model emerges from and it’s contains classical spacetime.
I did a little search, and if it’s worth anything Witten and Wheeler agree: https://www.quantamagazine.org/edward-witten-ponders-the-nature-of-reality-20171128/ (just search for ‘emergent’ in the article)
Can you have emergent spacetime while space symmetry remains a bedrock fundamental principle, and not emergent of something else?
I don’t know if that is a meaningful question.
Consider this: a cube is something that is symmetric under the octahedral group—that’s what *makes* it a cube. If it wasn’t symmetric under these transformations, it wouldn’t be a cube. So also with spacetime—it’s something that transforms according to the Poincaré group (plus some other mathematical properties, metric etc.). That’s what makes it spacetime.
So space symmetry is always assumed when we talk about spacetime, and if space symmetry didn’t hold, spacetime as we know it would not work/exist?
As a corollary: Maybe power laws for AI should not surprise us, they are simply the default outcome of scaling.
Simplified the solomonoff prior is the distribution you get when you take a uniform distribution over all strings and feed them to a turing machine.
Since the outputs are also strings: What happens if we iterate this? What is the stationary distribution? Is there even one? The fixed points will be quines, programs that copy their source code to the output. But how are they weighted? By their length? Presumably you can also have quine-cycles of programs that generate each other in turn, in a manner reminiscent metagenesis. Do these quine cycles capture all probability mass or does some diverge?
Very grateful for answers and literature suggestions.
A few quick observations (each with like 90% confidence; I won’t provide detailed arguments atm, but feel free to LW-msg me for more details):
Any finite number of iterates just gives you the solomonoff distribution up to at most a const multiplicative difference (with the const depending on how many iterates you do). My other points will be about the limit as we iterate many times.
The quines will have mass at least their prior, upweighted by some const because of programs which do not produce an infinite output string. They will generally have more mass than that, and some will gain mass by a larger multiplicative factor than others, but idk how to say something nice about this further.
Yes, you can have quine-cycles. Relevant tho not exactly this: https://github.com/mame/quine-relay
As you do more and more iterates, there’s not convergence to a stationary distribution, at least in total variation distance. One reason is that you can write a quine which adds a string to itself (and then adds the same string again next time, and so on)[1], creating “a way for a finite chunk of probability to escape to infinity”. So yes, some mass diverges.
Quine-cycles imply (or at least very strongly suggest) probabilities also do not converge pointwise.
What about pointwise convergence when we also average over the number of iterates? It seems plausible you get convergence then, but not sure (and not sure if this would be an interesting claim). It would be true if we could somehow think of the problem as living on a directed graph with countably many vertices, but idk how to do that atm.
There are many different stationary distributions — e.g. you could choose any distribution on the quines.
a construction from o3-mini-high: https://colab.research.google.com/drive/1kIGCiDzWT3guCskgmjX5oNoYxsImQre-?usp=sharing
Very relevant: https://web.archive.org/web/20090608111223/http://www.paul-almond.com/WhatIsALowLevelLanguage.htm
Thank you! I’ll have a look!
Is there an anthropic reason or computational (solomonoff-pilled) argument for why we would expect to the computational/causal graph of the universe to be this local (sparse)? Or at least appear local to a first approximation. (Bells-inequality)
This seems like a quite special property: I suspect that ether
it is not as rare in e.g. the solomonoff prior as we might first intuit, or
we should expect this for anthropic resons e.g. it is really hard to develop intelligence/do precidctions in nonlocal universes.
In physics, it is sometimes asked why there should be just three (large) space dimensions. No one really knows, but there are various mathematical properties unique to three or four dimensions, to which appeal is sometimes made.
I would also consider the recent (last few decades) interest in the emergence of spatial dimensions from entanglement. It may be that your question can be answered by considering these two things together.
Simplicity Priors are Tautological
Any non-uniform prior inherently encodes a bias toward simplicity. This isn’t an additional assumption we need to make—it falls directly out of the mathematics.
For any hypothesis h, the information content is I(h)=−log(P(h)), which means probability and complexity have an exponential relationship: P(h)=e−I(h)
This demonstrates that simpler hypotheses (those with lower information content) are automatically assigned higher probabilities. The exponential relationship creates a strong bias toward simplicity without requiring any special mechanisms.
The “simplicity prior” is essentially tautological—more probable things are simple by definition.
You can have a hypothesis with really high kolmogorov complexity, but if the hypothesis is true 50% of the time it will require 1 bit of information to specify with respect to a coding scheme that merely points to cached hypotheses.
This is why when kolmogorov complexity is defined it’s with respect to a fixed universal description language, as otherwise you’re right, it’s vacuous to talk about the simplicity of a hypothesis.
The Red Queen’s Race in Weight Space
In evolution we can tell a story that not only are genes selected for their function, but also for how easily modifiable they are. For example, having a generic antibiotic gene is much more useful than having an antibiotic locked into one target and far, in edit-distance terms, from any other useful variant.
Why would we expect the generic gene to be more common? There is selection pressure on having modifiable genes because environments are constantly shifting (the Red Queen hypothesis). Genes are modules with evolvability baked in by past selection.
Can we make a similar argument for circuits/features/modes in NNs? Obviously it is better to have a more general circuit, but can we also argue that “multitool circuits” are not only better at generalising but also more likely to be found?
SGD does not optimise loss but rather something like free energy, taking degeneracy (multiplicity) into account with some effective temperature.
But evolvability seems distinct from degeneracy. Degeneracy is a property of a single loss landscape, while evolvability is a claim about distribution shift. And the claim is not “I have low loss in the new distribution” but rather “I am very close to a low-loss solution of the new distribution.”
Degeneracy in ML ≈ mutational robustness in biology, which is straightforward, but that is not what I am pointing at here. Evolvability is closer to out-of-distribution adaptivity: the ability to move quickly into a new optimum with small changes.
Are there experiments where a model is trained on a shifting distribution?
Is the shifting distribution relevant or can this just as well be modeled as a mixture of the distributions, and what we think of as OOD is actually in the mixture distribution? In that case degeneracy is all you need.
Related ideas: cryptographic one-way functions (examples of unevolvable designs), out-of-distribution generalisation, mode connectivity.