One of the classic conceptual problems with a Solomonoff-style approach to probability, information, and stat mech is “Which Turing machine?”. The choice of Turing machine is analogous to the choice of prior in Bayesian probability. While universality means that any two Turing machines give roughly the same answers in the limit of large data (unlike two priors in Bayesian probability, where there is no universality assumption/guarantee), they can be arbitrarily different before then.
My usual answer to this problem is “well, ultimately this is all supposed to tell us things about real computational systems, so pick something which isn’t too unreasonable or complex for a real system”.
But lately I’ve been looking at Aram Ebtekar and Marcus Hutter’s Foundations of Algorithmic Thermodynamics. Based on both the paper and some discussion with Aram (along with Steve Petersen), I think there’s maybe a more satisfying answer to the choice-of-Turing-machine issue in there.
Two key pieces:
The “Comparison against Gibbs-Shannon entropy” section of the paper argues that uncomputability is a necessary feature, in order to assign entropy to individual states and still get a Second Law. The argument says: if there exists a short program which can provably find and output a high-entropy string S, then we can physically instantiate a machine to run that short program. Then, when that physical machine spits out the high-entropy string S, S could be used to erase another copy of S. In other words, there is some high-entropy state (S) which this physical machine + program could steer into a low-entropy state.
As Aram pointed out, most of the bounds have a constant for the complexity of the laws of physics. If we choose a machine for which the laws of physics have high complexity, then the bounds are quantitatively trash.
The first piece is a part of the theory which can only bind to reality insofar as our chosen Turing machine is tractable to physically implement. The second piece is a part of the theory which can only bind to reality insofar as our physics can be tractably implemented on our chosen Turing machine.
In other words: in order for this thermodynamic theory to work well, we need to choose a Turing machine which is “computationally equivalent to” physics, in the sense that our physics can run the machine without insane implementation size, and the machine can run our physics without insane implementation size.
I’m still wrapping my head around all the pieces here, so hopefully I (or, better yet, someone else) will write up a more clear explainer in the future. But this smells really promising to me. Not just for purposes of Solomonoff thermodynamics, but also as a more principled way to tackle bounded rationality of embedded systems.
Notably that post has a section arguing against roughly the sort of thing I’m arguing for:
Making the definition of what constitutes a low level language dependent on laws of physics is removing it from the realm of mathematics and philosophy. It is not a property of the language any more, but a property shared by the language and physical reality.
My response would be: yes, what-constitutes-a-low-level-language is obviously contingent on our physics and even on our engineering, not just on the language. I wouldn’t even expect aliens in our own universe to have low-level programming languages very similar to our own. Our low level languages today are extremely dependent on specific engineering choices made in the mid 20th century which are now very locked in by practice, but do not seem particularly fundamental or overdetermined, and would not be at all natural in universes with different physics or cultures with different hardware architecture. Aliens would look at our low-level languages and recognize them as low-level for our hardware, but not at all low-level for their hardware.
Analogously: choice of a good computing machine depends on the physics of one’s universe.
I do like the guy’s style of argumentation a lot, though.
the machine can run our physics without insane implementation size.
I’m well out of my depth here, and this is probably a stupid question, but given the standard views of the “known” part of our physics, does that mean that the machine can do operations on arbitrary, fully precise complex numbers in constant time?
The continuous state-space is coarse-grained into discrete cells where the dynamics are approximately markovian (the theory is currently classical) & the “laws of physics” probably refers to the stochastic matrix that specifies the transition probabilities of the discrete cells (otherwise we could probably deal with infinite precision through limit computability)
The current theory is based on classical hamiltonian mechanics, but I think the theorems apply whenever you have a markovian coarse-graining. Fermion doubling is a problem for spacetime discretization in the quantum case, so the coarse-graining might need to be different. (E.g. coarse-grain the entire hilbert space, which might have locality issues but probably not load-bearing for algorithmic thermodynamics)
On outside view, quantum reduces to classical (which admits markovian coarse-graining) in the correspondence limit, so there must be some coarse-graining that works
In practice, we only ever measure things to finite precision. To predict these observations, all we need is to be able to do these operations to any arbitrary specified precision. Runtime is not a consideration here; while time-constrained notions of entropy can also be useful, their theory becomes messier (e.g., the 2nd law won’t hold in its current form).
Good question, it’s the right sort of question to ask here, and I don’t know the answer. That does get straight into some interesting follow-up questions about e.g. the ability to physically isolate the machine from noise, which might be conceptually load-bearing for things like working with arbitrary precision quantities.
I’ve been thinking about it in terms of “but which language are we using to compute the complexity of our universe/laws of physics?”. Usually I likewise just go “only matters up to an additive constant, just assume we’re not using a Turing tarpit and we’re probably good”. If we do dig into it, though, what can we conclude?
Some thoughts:
What is the “objectively correct” reference language?
We should, of course, assume that the algorithm computing our universe is simple to describe in terms of the “natural” reference language, due to the simplicity prior. I. e., it should have support for the basic functions our universe’s physics computes. I think that’s already equivalent to “the machine can run our physics without insane implementation size”.
On the flip side, it’s allowed to lack support for functions our universe can’t cheaply compute. For example, it may not have primitive functions for solving NP-complete problems. (In theory, I think there was nothing stopping physics from having fundamental particles that absorb Traveling Salesman problems and near-instantly emit their solutions.)
Now suppose we also assume that our observations are sampled from the distribution over all observers in Tegmark 4. This means that when we’re talking about the language/TM underlying it, we’re talking about some “natural”, “objective” reference language.
What can we infer about it?
First, as mentioned, we should assume the reference language is not a Turing tarpit. After all, if we allowed reality to “think” in terms of some arbitrarily convoluted Turing-tarpit language, we could arbitrarily skew the simplicity prior.
But what is a “Turing tarpit” in that “global”/”objective” sense, not defined relative to some applications/programs? Intuitively, it feels like “one of the normal, sane languages that could easily implement all the other sane languages” should be possible to somehow formalize...
Which is to say: when we’re talking about the Kolmogorov complexity of some algorithm, in what language are we measuring it? Intuitively, we want to, in turn, pick one of the “simplest” languages to define.[1] But what language do we pick for measuring this language’s complexity? An infinite recursion follows.
Intuitively, there’s perhaps some way to short-circuit that recursion. (Perhaps by somehow defining the complexity of a language by weighing its complexity across “all” languages while prioritizing the opinions of those languages which are themselves simple in terms of whatever complexity measure this expression defines? Or something along those lines, circular definitions not always a problem. (Though see an essay Tsvi linked to which breaks down why many of those definitions don’t work.))
Regardless, if something like this is successful, we’ll get a “global” definition of what counts as a simple/natural language. This would, in turn, allow us to estimate the “objective” complexity of various problems, by measuring the length of their solutions in terms of that natural language (i. e., the length of the execution trace of a computation solving the problem). This would perhaps show that some problems are “objectively” hard, such as some theoretical/philosophical problems or the NP-complete problems.
The speed prior
What if we try to compute the complexity not of the laws of physics, but of a given observer-moment/universe-state, and penalize the higher-complexity ones?
In chaotic systems, this actually works out to the speed prior: i. e., to assuming that the later steps of a program have less realityfluid than the early ones. Two lines of reasoning:
The farther in time a state is,[2] the more precisely you have to specify the initial conditions in order to hit it.
Justification: Suppose the program’s initial state is parametrized by real numbers. As it evolves, ever-more-distant decimal digits become relevant. This means that, if you want to simulate the universe on a non-analog computer (i. e., a computer that doesn’t use unlimited-precision reals) from t=0 to t=n starting from some initial state S0, with the simulation error never exceeding some value, the precision with which you have to specify S0 scales with n. Indeed, as n goes to infinity, so does the needed precision (i. e., the description length).
Aside from picking the initial state that generates the observation, you also have to pinpoint that observation in the execution trace of the program. It can be as easy as defining the time-step (if you’re working with classical mechanics), or as difficult as pointing at a specific Everett branch. And pinpointing generally gets more expensive with time (even in the trivial case of “pick a time-step”, the length of the number you have to provide grows).
Anthropically, this means that the computations implementing us are (relatively) stable, and produce “interesting” states (relatively) quickly/in few steps.
Anyway, digging into the paper now...
1. Oh, I see it’s likewise concerned with the description length of states:
Gács [23] defines the coarse-grained algorithmic entropy of any individual state: roughly speaking, it is the number of bits of information that a fixed computer needs in order to identify the state’s coarse-grained cell. For example, a state in which all particles are concentrated in one location would have low entropy, because the repeated coordinates can be printed by a short program. If the coarse graining in question is Markovian, then Levin’s [24] law of randomness conservation says that the algorithmic entropy seldom decreases. In physical terms, we will come to see this as a vast generalization of the second law of thermodynamics
2. The way the paper justifies the second law of thermodynamics is neat.
My understanding of that
Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
We define the coarse-graining as a truncation of those real numbers: i. e., we fix some degree of precision.
That degree of precision could be, for example, the Planck length.
At the microstate level, the laws of physics may be deterministic and reversible.
At the macrostate level, the laws of physics are stochastic and irreversible. We define them as a Markov process, with transition probabilities P(x,y) defined as “the fraction of the microstates in the macrostate x that map to the macrostate y in the next moment”.
Over time, our ability to predict what state the system is in from our knowledge of its initial coarse-grained state + the laws of physics degrades.
Macroscopically, it’s because of the properties of the specific stochastic dynamic we have to use (this is what most of the paper is proving, I think).
Microscopically, it’s because ever-more-distant decimal digits in the definition of the initial state start influencing dynamics ever stronger. (See the multibaker map in Appendix A, the idea of “microscopic mixing” in a footnote, and also apparently Kolmogorov-Sinai entropy.)
That is: in order to better pinpoint farther-in-time states, we would have to spend more bits (either by defining more fine-grained macrostates, or maybe by locating them in the execution trace).
Thus: stochasticity, and the second law, are downstream of the fact that we cannot define the initial state with infinite precision.
3. The part about incomputability being necessary is also interesting, metaphysically.
Why must it be impossible to prove lower bounds on Kolmogorov complexity?
So, Kolmogorov complexity is upper-semicomputable. This means that, for some x:
You can prove an upper bound on K(x), just by finding a program that computes x.
You can only prove a lower bound on K(x) using a program p with K(p)>K(x). Meaning, you can’t use any fixed-size program (or formal system) to prove arbitrarily high complexity.
Imagine if it were otherwise, if some p much smaller than K(x) could prove a lower bound on K(x). Then you could use that p to cheaply pinpoint x: by setting up a program that goes through programs in order, uses p to estimate the lower bound on their K(x), then outputs the first program whose complexity is above a threshold. Which would simultaneously functions as an upper bound on K(x): since our small program was able to compute it, K(x) can’t be higher than K(p).
Thus, in order for arbitrarily complex states/programs to exist, it must be impossible to prove that they are complex.
Why? Why does that have to be the case?
Intuitively, it’s because “proving” complexity requires pointing at specific features of the state x and explaining why exactly they are complex. That is, your formal language must be expressive enough to precisely talk about those features, in their full detail. If, however, you can get away with using some abstractions/generalizations to prove x‘s complexity, that by definition decreases x’s complexity.
Impromptu poll: is structuring long-form comments this way, with collapsibles for topics, convenient, or should I have just used titles? Please react with thumbs up/down to the following statement: “collapsibles good”.
All that said,
But this smells really promising to me [...] as a more principled way to tackle bounded rationality of embedded systems.
I’m curious what you have in mind here. I’ve kind of been treating my thinking on those topics as basically recreational/a guilty pleasure. The possibility that there’s something actually useful here interests me.
If we need to use a Turing machine which is roughly equivalent to physics, then a natural next step is to drop the assumption that the machine in question is Turing complete. Just pick some class of machines which can efficiently simulate our physics, and which can be efficiently implemented in our physics. And then, one might hope, the sort of algorithmic thermodynamic theory the paper presents can carry over to that class of machines.
Probably there are some additional requirements for the machines, like some kind of composability, but I don’t know exactly what they are.
This would also likely result in a direct mapping between limits on the machines (like e.g. limited time or memory) and corresponding limits on the physical systems to which the theory applies for those machines.
The resulting theory would probably read more like classical thermo, where we’re doing thought experiments involving fairly arbitrary machines subject to just a few constraints, and surprisingly general theorems pop out.
Attempted abstraction and generalization: If we don’t know what the ideal UTM is, we can start with some arbitrary UTM U1, and use it to predict the world for a while. After (we think) we’ve gotten most of our prediction mistakes out of the way, we can then look at our current posterior, and ask which other UTM U2 might have updated to that posterior faster, using less bits of observation about (our universe/the string we’re predicting). You could think of this as a way to define what the ‘correct’ UTM is. But I don’t find that definition very satisfying, because the validity of this procedure for finding a good U2 depends on how correct the posterior we’ve converged on with our previous, arbitrary, U1 is. ‘The best UTM is the one that figures out the right answer the fastest’ is true, but not very useful.
Is the thermodynamics angle gaining us any more than that for defining the ‘correct’ choice of UTM?
We used some general reasoning procedures to figure out some laws of physics and stuff about our universe. Now we’re basically asking what other general reasoning procedures might figure out stuff about our universe as fast or faster, conditional on our current understanding of our universe being correct.
I think that’s roughly correct, but it is useful...
‘The best UTM is the one that figures out the right answer the fastest’ is true, but not very useful.
Another way to frame it would be: after one has figured out the laws of physics, a good-for-these-laws-of-physics Turning machine is useful for various other things, including thermodynamics. ‘The best UTM is the one that figures out the right answer the fastest’ isn’t very useful for figuring out physics in the first place, but most of the value of understanding physics comes after it’s figured out (as we can see from regular practice today).
Also, we can make partial updates along the way. If e.g. we learn that physics is probably local but haven’t understood all of it yet, then we know that we probably want a local machine for our theory. If we e.g. learn that physics is causally acyclic, then we probably don’t want a machine with access to atomic unbounded fixed-point solvers. Etc.
I agree that this seems maybe useful for some things, but not for the “Which UTM?” question in the context of debates about Solomonoff induction specifically, and I think that’s the “Which UTM?” question we are actually kind of philosophically confused about. I don’t think we are philosophically confused about which UTM to use in the context of us already knowing some physics and wanting to incorporate that knowledge into the UTM pick, we’re confused about how to pick if we don’t have any information at all yet.
I think roughly speaking the answer is: whichever UTM you’ve been given. I aim to write a more precise answer in an upcoming paper specifically about Solomonoff induction. The gist of it is that the idea of a “better UTM” U_2 is about as absurd as that of a UTM that has hardcoded knowledge of the future: yes such UTMs exists, but there is no way to obtain it without first looking at the data, and the best way to update on data is already given by Solomonoff induction.
I also talked to Aram recently & he’s optimistic that there’s an algorithmic version of the generalized heat engine where the hot vs cold pool correspond to high vs low k-complexity strings. I’m quite interested in doing follow-up work on that
Yes! I expect the temperatures won’t quite be proportional to complexity, but we should be able to reuse the thermodynamic definition of temperature as a derivative of entropy, which we’ve now replaced by K-complexity.
This suggests that choice of the machines in algorithmic priors should be meaningful data for the purposes of agent foundations, and gives some sort of an explanation for how agents with different preferences tend to arrive at the same probabilities of events (after updating on a lot of data), and so agreeing on questions of fact, while still keeping different preferences and endorsing different decisions, so disagreeing on questions of normativity.
This just talks about the bits of program available in our physics’ subroutine of a simulation tree, rather than about a universal across Teg 4 convergence, right?
(probably the bit it does is the useful bit, I’ve just been wishing for some convergent UTM for the multiverse for philosophical satisfaction for a while)
Yeah, I’m not convinced that the problem of induction is solvable at Teg 4. However, Universes with similar primitive laws and operations to ours will tend to produce intelligences with similar built-in priors. Thus, the right UTM to use is in a sense just the one that you happen to have in your possession.
I would have just answered “It depends on what you want to do”, with there being no set best prior/Universal Turing Machine, because of theorems like the No Free Lunch theorem (and more generally a takeaway from learning/computational theories is that there is no one best prior that was always justified, contrary to the ancient philosopher’s hopes).
I will propose an answer to No Free Lunch in an upcoming paper about Solomonoff induction. It is indeed subtle and important. In the interim, Schurz’ book “Hume’s Problem Solved” is a pretty good take. Schurz and Wolpert seem to argue against each other in their writing about NFL; I’ll explain later why I think they’re both right.
For a concrete answer on what the reference machine or low-level language should be, please see this 10-minute live-discussion only about the choice of the reference machine, starting at minute 20 and ending at minute 30: https://www.youtube.com/live/FNfGoQhf2Zw?si=Pg1ppTZmlw1S-3g9&t=1206 After one hour and 18 minutes, I spend another couple of minutes answering a question about the reference formalism. After one hour and 30 minutes into the video, someone asked me whether space aliens would agree with lambda calculus. And in my paper, I have a 3-page discussion on the choice of the reference machine, Section 3.2: https://arxiv.org/pdf/2506.23194 The reason that I did not suggest that one should derive a reference machine from physics is that arriving at a consensus about the laws of physics will already have required the use of either Occam’s razor, common sense, or intuition, thus making the derivation seem circular, or otherwise, equivalent to choosing a simple reference machine directly based on its commonsensical simplicity in the first place but with extra steps through physics which might be redundant, depending on what exactly Aram’s argument was about.
One of the classic conceptual problems with a Solomonoff-style approach to probability, information, and stat mech is “Which Turing machine?”. The choice of Turing machine is analogous to the choice of prior in Bayesian probability. While universality means that any two Turing machines give roughly the same answers in the limit of large data (unlike two priors in Bayesian probability, where there is no universality assumption/guarantee), they can be arbitrarily different before then.
My usual answer to this problem is “well, ultimately this is all supposed to tell us things about real computational systems, so pick something which isn’t too unreasonable or complex for a real system”.
But lately I’ve been looking at Aram Ebtekar and Marcus Hutter’s Foundations of Algorithmic Thermodynamics. Based on both the paper and some discussion with Aram (along with Steve Petersen), I think there’s maybe a more satisfying answer to the choice-of-Turing-machine issue in there.
Two key pieces:
The “Comparison against Gibbs-Shannon entropy” section of the paper argues that uncomputability is a necessary feature, in order to assign entropy to individual states and still get a Second Law. The argument says: if there exists a short program which can provably find and output a high-entropy string S, then we can physically instantiate a machine to run that short program. Then, when that physical machine spits out the high-entropy string S, S could be used to erase another copy of S. In other words, there is some high-entropy state (S) which this physical machine + program could steer into a low-entropy state.
As Aram pointed out, most of the bounds have a constant for the complexity of the laws of physics. If we choose a machine for which the laws of physics have high complexity, then the bounds are quantitatively trash.
The first piece is a part of the theory which can only bind to reality insofar as our chosen Turing machine is tractable to physically implement. The second piece is a part of the theory which can only bind to reality insofar as our physics can be tractably implemented on our chosen Turing machine.
In other words: in order for this thermodynamic theory to work well, we need to choose a Turing machine which is “computationally equivalent to” physics, in the sense that our physics can run the machine without insane implementation size, and the machine can run our physics without insane implementation size.
I’m still wrapping my head around all the pieces here, so hopefully I (or, better yet, someone else) will write up a more clear explainer in the future. But this smells really promising to me. Not just for purposes of Solomonoff thermodynamics, but also as a more principled way to tackle bounded rationality of embedded systems.
Cf. https://web.archive.org/web/20120331071849/http://www.paul-almond.com/WhatIsALowLevelLanguage.htm
Notably that post has a section arguing against roughly the sort of thing I’m arguing for:
My response would be: yes, what-constitutes-a-low-level-language is obviously contingent on our physics and even on our engineering, not just on the language. I wouldn’t even expect aliens in our own universe to have low-level programming languages very similar to our own. Our low level languages today are extremely dependent on specific engineering choices made in the mid 20th century which are now very locked in by practice, but do not seem particularly fundamental or overdetermined, and would not be at all natural in universes with different physics or cultures with different hardware architecture. Aliens would look at our low-level languages and recognize them as low-level for our hardware, but not at all low-level for their hardware.
Analogously: choice of a good computing machine depends on the physics of one’s universe.
I do like the guy’s style of argumentation a lot, though.
The proposal at the end looks somewhat promising to me on a first skim. Are there known counterpoints for it?
I’m well out of my depth here, and this is probably a stupid question, but given the standard views of the “known” part of our physics, does that mean that the machine can do operations on arbitrary, fully precise complex numbers in constant time?
The continuous state-space is coarse-grained into discrete cells where the dynamics are approximately markovian (the theory is currently classical) & the “laws of physics” probably refers to the stochastic matrix that specifies the transition probabilities of the discrete cells (otherwise we could probably deal with infinite precision through limit computability)
Doesn’t such a discretization run into the fermion doubling problem?
The current theory is based on classical hamiltonian mechanics, but I think the theorems apply whenever you have a markovian coarse-graining. Fermion doubling is a problem for spacetime discretization in the quantum case, so the coarse-graining might need to be different. (E.g. coarse-grain the entire hilbert space, which might have locality issues but probably not load-bearing for algorithmic thermodynamics)
On outside view, quantum reduces to classical (which admits markovian coarse-graining) in the correspondence limit, so there must be some coarse-graining that works
In practice, we only ever measure things to finite precision. To predict these observations, all we need is to be able to do these operations to any arbitrary specified precision. Runtime is not a consideration here; while time-constrained notions of entropy can also be useful, their theory becomes messier (e.g., the 2nd law won’t hold in its current form).
Good question, it’s the right sort of question to ask here, and I don’t know the answer. That does get straight into some interesting follow-up questions about e.g. the ability to physically isolate the machine from noise, which might be conceptually load-bearing for things like working with arbitrary precision quantities.
I’ve been thinking about it in terms of “but which language are we using to compute the complexity of our universe/laws of physics?”. Usually I likewise just go “only matters up to an additive constant, just assume we’re not using a Turing tarpit and we’re probably good”. If we do dig into it, though, what can we conclude?
Some thoughts:
What is the “objectively correct” reference language?
We should, of course, assume that the algorithm computing our universe is simple to describe in terms of the “natural” reference language, due to the simplicity prior. I. e., it should have support for the basic functions our universe’s physics computes. I think that’s already equivalent to “the machine can run our physics without insane implementation size”.
On the flip side, it’s allowed to lack support for functions our universe can’t cheaply compute. For example, it may not have primitive functions for solving NP-complete problems. (In theory, I think there was nothing stopping physics from having fundamental particles that absorb Traveling Salesman problems and near-instantly emit their solutions.)
Now suppose we also assume that our observations are sampled from the distribution over all observers in Tegmark 4. This means that when we’re talking about the language/TM underlying it, we’re talking about some “natural”, “objective” reference language.
What can we infer about it?
First, as mentioned, we should assume the reference language is not a Turing tarpit. After all, if we allowed reality to “think” in terms of some arbitrarily convoluted Turing-tarpit language, we could arbitrarily skew the simplicity prior.
But what is a “Turing tarpit” in that “global”/”objective” sense, not defined relative to some applications/programs? Intuitively, it feels like “one of the normal, sane languages that could easily implement all the other sane languages” should be possible to somehow formalize...
Which is to say: when we’re talking about the Kolmogorov complexity of some algorithm, in what language are we measuring it? Intuitively, we want to, in turn, pick one of the “simplest” languages to define.[1] But what language do we pick for measuring this language’s complexity? An infinite recursion follows.
Intuitively, there’s perhaps some way to short-circuit that recursion. (Perhaps by somehow defining the complexity of a language by weighing its complexity across “all” languages while prioritizing the opinions of those languages which are themselves simple in terms of whatever complexity measure this expression defines? Or something along those lines, circular definitions not always a problem. (Though see an essay Tsvi linked to which breaks down why many of those definitions don’t work.))
Regardless, if something like this is successful, we’ll get a “global” definition of what counts as a simple/natural language. This would, in turn, allow us to estimate the “objective” complexity of various problems, by measuring the length of their solutions in terms of that natural language (i. e., the length of the execution trace of a computation solving the problem). This would perhaps show that some problems are “objectively” hard, such as some theoretical/philosophical problems or the NP-complete problems.
The speed prior
What if we try to compute the complexity not of the laws of physics, but of a given observer-moment/universe-state, and penalize the higher-complexity ones?
In chaotic systems, this actually works out to the speed prior: i. e., to assuming that the later steps of a program have less realityfluid than the early ones. Two lines of reasoning:
The farther in time a state is,[2] the more precisely you have to specify the initial conditions in order to hit it.
Justification: Suppose the program’s initial state is parametrized by real numbers. As it evolves, ever-more-distant decimal digits become relevant. This means that, if you want to simulate the universe on a non-analog computer (i. e., a computer that doesn’t use unlimited-precision reals) from t=0 to t=n starting from some initial state S0, with the simulation error never exceeding some value, the precision with which you have to specify S0 scales with n. Indeed, as n goes to infinity, so does the needed precision (i. e., the description length).
Aside from picking the initial state that generates the observation, you also have to pinpoint that observation in the execution trace of the program. It can be as easy as defining the time-step (if you’re working with classical mechanics), or as difficult as pointing at a specific Everett branch. And pinpointing generally gets more expensive with time (even in the trivial case of “pick a time-step”, the length of the number you have to provide grows).
Anthropically, this means that the computations implementing us are (relatively) stable, and produce “interesting” states (relatively) quickly/in few steps.
Anyway, digging into the paper now...
1. Oh, I see it’s likewise concerned with the description length of states:
2. The way the paper justifies the second law of thermodynamics is neat.
My understanding of that
Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
We define the coarse-graining as a truncation of those real numbers: i. e., we fix some degree of precision.
That degree of precision could be, for example, the Planck length.
At the microstate level, the laws of physics may be deterministic and reversible.
At the macrostate level, the laws of physics are stochastic and irreversible. We define them as a Markov process, with transition probabilities P(x,y) defined as “the fraction of the microstates in the macrostate x that map to the macrostate y in the next moment”.
Over time, our ability to predict what state the system is in from our knowledge of its initial coarse-grained state + the laws of physics degrades.
Macroscopically, it’s because of the properties of the specific stochastic dynamic we have to use (this is what most of the paper is proving, I think).
Microscopically, it’s because ever-more-distant decimal digits in the definition of the initial state start influencing dynamics ever stronger. (See the multibaker map in Appendix A, the idea of “microscopic mixing” in a footnote, and also apparently Kolmogorov-Sinai entropy.)
That is: in order to better pinpoint farther-in-time states, we would have to spend more bits (either by defining more fine-grained macrostates, or maybe by locating them in the execution trace).
Thus: stochasticity, and the second law, are downstream of the fact that we cannot define the initial state with infinite precision.
3. The part about incomputability being necessary is also interesting, metaphysically.
Why must it be impossible to prove lower bounds on Kolmogorov complexity?
So, Kolmogorov complexity is upper-semicomputable. This means that, for some x:
You can prove an upper bound on K(x), just by finding a program that computes x.
You can only prove a lower bound on K(x) using a program p with K(p)>K(x). Meaning, you can’t use any fixed-size program (or formal system) to prove arbitrarily high complexity.
Imagine if it were otherwise, if some p much smaller than K(x) could prove a lower bound on K(x). Then you could use that p to cheaply pinpoint x: by setting up a program that goes through programs in order, uses p to estimate the lower bound on their K(x), then outputs the first program whose complexity is above a threshold. Which would simultaneously functions as an upper bound on K(x): since our small program was able to compute it, K(x) can’t be higher than K(p).
Thus, in order for arbitrarily complex states/programs to exist, it must be impossible to prove that they are complex.
Why? Why does that have to be the case?
Intuitively, it’s because “proving” complexity requires pointing at specific features of the state x and explaining why exactly they are complex. That is, your formal language must be expressive enough to precisely talk about those features, in their full detail. If, however, you can get away with using some abstractions/generalizations to prove x‘s complexity, that by definition decreases x’s complexity.
Impromptu poll: is structuring long-form comments this way, with collapsibles for topics, convenient, or should I have just used titles? Please react with thumbs up/down to the following statement: “collapsibles good”.
All that said,
I’m curious what you have in mind here. I’ve kind of been treating my thinking on those topics as basically recreational/a guilty pleasure. The possibility that there’s something actually useful here interests me.
Since that would allow its emulators to be common across Tegmark IV, which would, in turn, give a bump to any algorithms simple in its terms.
More specifically, the first occurrence of that observation.
What I have in mind re:boundedness...
If we need to use a Turing machine which is roughly equivalent to physics, then a natural next step is to drop the assumption that the machine in question is Turing complete. Just pick some class of machines which can efficiently simulate our physics, and which can be efficiently implemented in our physics. And then, one might hope, the sort of algorithmic thermodynamic theory the paper presents can carry over to that class of machines.
Probably there are some additional requirements for the machines, like some kind of composability, but I don’t know exactly what they are.
This would also likely result in a direct mapping between limits on the machines (like e.g. limited time or memory) and corresponding limits on the physical systems to which the theory applies for those machines.
The resulting theory would probably read more like classical thermo, where we’re doing thought experiments involving fairly arbitrary machines subject to just a few constraints, and surprisingly general theorems pop out.
Attempted abstraction and generalization: If we don’t know what the ideal UTM is, we can start with some arbitrary UTM U1, and use it to predict the world for a while. After (we think) we’ve gotten most of our prediction mistakes out of the way, we can then look at our current posterior, and ask which other UTM U2 might have updated to that posterior faster, using less bits of observation about (our universe/the string we’re predicting). You could think of this as a way to define what the ‘correct’ UTM is. But I don’t find that definition very satisfying, because the validity of this procedure for finding a good U2 depends on how correct the posterior we’ve converged on with our previous, arbitrary, U1 is. ‘The best UTM is the one that figures out the right answer the fastest’ is true, but not very useful.
Is the thermodynamics angle gaining us any more than that for defining the ‘correct’ choice of UTM?
We used some general reasoning procedures to figure out some laws of physics and stuff about our universe. Now we’re basically asking what other general reasoning procedures might figure out stuff about our universe as fast or faster, conditional on our current understanding of our universe being correct.
I think that’s roughly correct, but it is useful...
Another way to frame it would be: after one has figured out the laws of physics, a good-for-these-laws-of-physics Turning machine is useful for various other things, including thermodynamics. ‘The best UTM is the one that figures out the right answer the fastest’ isn’t very useful for figuring out physics in the first place, but most of the value of understanding physics comes after it’s figured out (as we can see from regular practice today).
Also, we can make partial updates along the way. If e.g. we learn that physics is probably local but haven’t understood all of it yet, then we know that we probably want a local machine for our theory. If we e.g. learn that physics is causally acyclic, then we probably don’t want a machine with access to atomic unbounded fixed-point solvers. Etc.
I agree that this seems maybe useful for some things, but not for the “Which UTM?” question in the context of debates about Solomonoff induction specifically, and I think that’s the “Which UTM?” question we are actually kind of philosophically confused about. I don’t think we are philosophically confused about which UTM to use in the context of us already knowing some physics and wanting to incorporate that knowledge into the UTM pick, we’re confused about how to pick if we don’t have any information at all yet.
I think roughly speaking the answer is: whichever UTM you’ve been given. I aim to write a more precise answer in an upcoming paper specifically about Solomonoff induction. The gist of it is that the idea of a “better UTM” U_2 is about as absurd as that of a UTM that has hardcoded knowledge of the future: yes such UTMs exists, but there is no way to obtain it without first looking at the data, and the best way to update on data is already given by Solomonoff induction.
I also talked to Aram recently & he’s optimistic that there’s an algorithmic version of the generalized heat engine where the hot vs cold pool correspond to high vs low k-complexity strings. I’m quite interested in doing follow-up work on that
Yes! I expect the temperatures won’t quite be proportional to complexity, but we should be able to reuse the thermodynamic definition of temperature as a derivative of entropy, which we’ve now replaced by K-complexity.
But also, even with universality, (algorithmic) Jeffrey-Bolker preference remains dependent on the machines that define its two algorithmic priors (it assigns expected utility to an event as the ratio of two probability measures of that event, using two different priors on the same sample space).
This suggests that choice of the machines in algorithmic priors should be meaningful data for the purposes of agent foundations, and gives some sort of an explanation for how agents with different preferences tend to arrive at the same probabilities of events (after updating on a lot of data), and so agreeing on questions of fact, while still keeping different preferences and endorsing different decisions, so disagreeing on questions of normativity.
This just talks about the bits of program available in our physics’ subroutine of a simulation tree, rather than about a universal across Teg 4 convergence, right?
(probably the bit it does is the useful bit, I’ve just been wishing for some convergent UTM for the multiverse for philosophical satisfaction for a while)
Yeah, I’m not convinced that the problem of induction is solvable at Teg 4. However, Universes with similar primitive laws and operations to ours will tend to produce intelligences with similar built-in priors. Thus, the right UTM to use is in a sense just the one that you happen to have in your possession.
Yeah, I mostly think that this is where it ends up, but it would be so neat if it there was convergence.
A proof of exactly why that’s not an option might also be similarly satisfying/enlightening.
Is this wish compatible with not throwing away a free lunch?
it’s settling on a universal price for each lunch, rather than just subjective ones depending on which lunch you’re near
I would have just answered “It depends on what you want to do”, with there being no set best prior/Universal Turing Machine, because of theorems like the No Free Lunch theorem (and more generally a takeaway from learning/computational theories is that there is no one best prior that was always justified, contrary to the ancient philosopher’s hopes).
Then you would have been wrong. No Free Lunch Theorems do not bind to reality.
I will propose an answer to No Free Lunch in an upcoming paper about Solomonoff induction. It is indeed subtle and important. In the interim, Schurz’ book “Hume’s Problem Solved” is a pretty good take. Schurz and Wolpert seem to argue against each other in their writing about NFL; I’ll explain later why I think they’re both right.
For a concrete answer on what the reference machine or low-level language should be, please see this 10-minute live-discussion only about the choice of the reference machine, starting at minute 20 and ending at minute 30: https://www.youtube.com/live/FNfGoQhf2Zw?si=Pg1ppTZmlw1S-3g9&t=1206
After one hour and 18 minutes, I spend another couple of minutes answering a question about the reference formalism. After one hour and 30 minutes into the video, someone asked me whether space aliens would agree with lambda calculus.
And in my paper, I have a 3-page discussion on the choice of the reference machine, Section 3.2: https://arxiv.org/pdf/2506.23194
The reason that I did not suggest that one should derive a reference machine from physics is that arriving at a consensus about the laws of physics will already have required the use of either Occam’s razor, common sense, or intuition, thus making the derivation seem circular, or otherwise, equivalent to choosing a simple reference machine directly based on its commonsensical simplicity in the first place but with extra steps through physics which might be redundant, depending on what exactly Aram’s argument was about.