K-complexity is silly; use cross-entropy instead

So8res20 Dec 2022 23:06 UTC

141 points

Kolmogorov Complexity Solomonoff Induction Entropy World Modeling

Short version

The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.

Long version

Suppose we have a (Turing-complete) programming language $P$ , and a function $f$ of the type that can be named by $P$ .

For example, $f$ might be the function that takes (as input) a list of numbers, and sorts it (by producing, as output, another list of numbers, with the property that the output list has the same elements as the input list, but in ascending order). Within the programming language $P$ , there will be lots of different programs that represent $f$ , such as a whole host of implementations of the bubblesort algorithm, and a whole host of implementations of the quicksort algorithm, and a whole host of implementations of the mergesort algorithm. Note the difference between the notion of the function (“list sorting”) and the programs that represent it (bubblesort, quicksort, mergesort).

Recall that the Kolmogorov complexity of $f$ in the language $P$ is the length of the shortest program that represents $f$ :

${K-complexity}_{P} (f) := {a r g m i n}_{{p \in P ∣ e v a l (p) = f}} l e n g t h (p)$

This is often touted as a measure of the “complexity” of $f$ , to the degree that people familiar with the concept often (colloquially) call a function $f$ “simple” precisely to the degree that it has low K-complexity.

I claim that this is a bad definition, and propose the following alternative:

${alt-complexity}_{P} (f) := {n l o g}_{2} \sum {p \in P ∣ e v a l (p) = f} {r e x p}_{2} (l e n g t h (p))$

where $n l o g_{2}$ denotes logarithm base $\frac{1}{2}$ , aka the negative of the (base 2) logarithm, and ${r e x p}_{2}$ denotes exponentiation base $\frac{1}{2}$ , aka the reciprocal of the (base 2) exponential.

(Note that we could just as easily use any other base $b > 1$ . $e$ would be a particularly natural choice, as usual. Here I’m using $2$ , both because it fits with measuring the lengths of our programs in terms of bits, and because it keeps the numbers whole in our examples.)

Below, I’ll explore this latter definition, and its elegance and theoretical superiority. Then I’ll point out that our own laws of physics seem to have (comparatively) high K-complexity and low alt-complexity, thus giving empirical justification for my “correction”.

Investigation

A first observation is that the alt-complexity and the K-complexity agree whenever there is at most one program in $P$ that represents $f$ . If there’s no program, then both equations are (positive) infinite. If there’s exactly one program $p_{*} \in P$ representing $f$ , then $p_{*}$ will be the only term in the $a r g m i n$ and the only term in the $\sum$ , so the first definition will yield $l e n g t h (p_{*})$ whereas the second definition will yield ${n l o g}_{2} ({r e x p}_{2} (l e n g t h (p_{*})))$ , but ${n l o g}_{2}$ and ${r e x p}_{2}$ are inverses, so both definitions yield $l e n g t h (p_{*})$ . Thus, the definitions only differ when $f$ has multiple programs in the language $P$ .

In that case, the alt-complexity will be lower than the K-complexity, as you may verify. As a simple example, suppose there are two different programs $p_{1}, p_{2} \in P$ that represent $f$ , both of length 17. Then the K-complexity of $f$ is 17 bits, whereas the alt-complexity of $f$ is

${n l o g}_{2} ({r e x p}_{2} (17) + {r e x p}_{2} (17)) = - l o g_{2} (2^{- 17} + 2^{- 17}) = - l o g_{2} (2^{- 16}) = 16 bits .$

According to alt-complexity, having two programs (of the same length) that represent $f$ is just as good as having a single program that’s one bit shorter. By a similar token, having 256 programs that are each $n + 8$ bits long, is (according to alt-complexity but not K-complexity) just as good as having a single program that’s $n$ bits long.

Why might this make sense? Well, suppose you’re writing a program that (say) renders a certain 3D scene. You have to make some arbitrary choices to do the rendering, like choosing which point is the 0 point (the center of the room? one of the eight corners?), and how to orient the axes (mathematicians and video game devs have different Y-axis conventions, last I checked), and so on. Should these arbitrary choices count against the complexity of your code? They definitely make your code longer, but it’s not obvious that they make your code “fundamentally more complicated” in the way that extra if statements and for loops do. In attempts to formalize this intuition, we might object: yes, we had to make some arbitrary choices in order to render the scene, but these choices don’t matter: for every possible choice of these conventions, there’s a similarly-short program that renders my scene.

(Now, if you want to talk about the complexity of your code relative to a library that has hard-coded in some particular conventions, then it will be simpler to follow the same conventions, because only in that case do you get to avoid writing the conversions. But if we take your scene-rendering code and the surrounding libraries as a whole package, then the point holds: for every choice of conventions, there’s similarly-simple code that renders the same scene.)

Cross-entropy

A second observation is that the alt-complexity of $f$ in the language $P$ is precisely the cross-entropy of the distribution $P$ relative to the distribution $f$ .

“Wait, what?” you protest. ” $P$ is a language, and $f$ is a function!”

Yep, but we can promote both $P$ and $f$ to distributions in natural ways, and once we do, we see that this notion of alt-complexity is exactly cross-entropy.

We turn a programming language $P$ into a probability distribution on functions by saying that the probability of a function $g$ is the probability that a randomly-generated program is a representative of $g$ :

$P (g) := \sum {p \in P ∣ e v a l (p) = g} {r e x p}_{2} (l e n g t h (p))$

(Note that this equation (probably?) works when you’ve formalized your notion of “programming language” and the corresponding notion of “length” in convenient ways. If you didn’t pick the most convenient definitions, you might need to tweak this equation, but I don’t expect those tweaks to change the story much.)

Equivalently, the probability of a function $g$ in a programming language $P$ is taken to be the probability that Solomonoff induction assigns to $g$ when using the language $P$ .

We turn a function $f$ into a probability distribution on functions by taking the probability distribution that assigns 100% probability to $f$ (and 0% probability to every other function).

$f (g) := {\begin{matrix} 1 & if g = f 0 & otherwise \end{matrix}$

Then the cross-entropy of $P$ relative to $f$ is defined as

$H (f, P) := E_{f} (- {log}_{2} (P)) = \sum g - f (g) {log}_{2} (P (g)) = {n l o g}_{2} (P (f))$

which is just the alt-complexity of $f$ (in the language $P$ ).

So if our notion of “complexity” takes all the programs for $f$ into account, instead of just the shortest one, then it says that the complexity of a function $f$ is just the cross-entropy of Solomonoff induction when $f$ is the ground truth.

In other words: the alt-complexity of $f$ is the degree of surprise (in bits) that Solomonoff induction (predicting a function, using your chosen programming language) would experience if $f$ were true.

Solomonoff induction

Solomonoff induction also offers a vote in favor of alt-complexity over K-complexity.

I occasionally hear people say that Solomonoff induction concentrates probability-mass on the hypothesis with lowest K-complexity among those hypotheses that fit the data. This is false. Solomonoff induction concentrates probability-mass on the hypothesis with lowest alt-complexity among those hypotheses that fit the data.

Low alt-complexity happens to coincide with low K-complexity pretty often, but whenever it doesn’t, Solomonoff induction prefers alt-simplicity to K-simplicity. To see this, consider the case where a single 4-bit code predicts a 0, and three 5-bit codes predict a 1, and all other codes have been ruled out by the data (or are too long to make any difference to the current calculation). In this case, observe that the probability Solomonoff induction places on 1 exceeds the probability that Solomonoff induction places on 0, precisely because the alt-simplicity of 1 exceeds the alt-simplicity of 0.^[1]

Elegance

When I noticed the intuition that you shouldn’t be dinged for making an arbitrary choice-of-convention in your code, if every convention yields a similarly-short program, I had the thought that there should obviously be some way to combine all of the lengths of all of the programs that represent a given function. The “nlog-sum-rexp” formula above is the result of figuring out the most natural-feeling weighting. Like, you can’t just sum together all the lengths, as that would penalize a function for having more programs, which isn’t what we want. So, follow the intuition that two 17-bit programs should be just about as good as one 16-bit program, and then ask how we must be combining lengths. Program-lengths are (canonically) measured in bits, so by exponentiating them we get something that’s intuitively more like the “fraction of codespace” that’s taken up by that program (this intuition is exact if we’re using prefix-free codes). These are non-overlapping, and so we can sum them, and then take the nlogarithm to get back to bits of complexity.

As an interesting historical note, it was only after that sequence of thoughts that I noticed that I’d re-invented both Solomonoff induction and (a special case of) cross-entropy.

Of course, having familiarity with both those concepts probably helped me have the above sequence of thoughts as rapidly as I did, but this still felt to me like evidence of elegance: K-complexity is clearly a bit inelegant, and if you fix it in the way that it’s begging to be fixed, you land squarely on other useful battle-hardened concepts.

Empiricism

But perhaps this evidence of elegance is lost on you (as someone who didn’t first work out the obvious correction to K-complexity, and then notice with personal delight that you’d re-invented cross-entropy; or as someone who doesn’t have pre-existing respect for Solomonoff induction). Or perhaps you just delight in more overwhelming evidence.

In that case, I direct your attention to the laws of physics.

As you may have heard, the laws of physics are relativistic. The most naive method of describing a physical configuration involves some arbitrary choices, such as a choice of where the origin is (is it between my eyes, or between yours?), and a choice of orientation, and a choice of velocities. That’s a lot of extra data! Fortunately for us, relativity guarantees that the universe can be described no matter which conventions we choose. The laws of physics are the same no matter what location we declare to be the “origin”, and no matter what trajectory we claim is “at rest”. A naive computer program that simulates classical relativistic physics makes all sorts of arbitrary choices, but for every way of making those choices there’s a way of describing the universe such that the program still works, and so those arbitrary choices don’t really count against us.

Or, at least, they don’t really count against us if we expect our universe to have low alt-complexity. Contentless choices of convention do count against K-complexity!

“Hold on”, you might protest. “In a naive representation of a classical relativistic universe, you have to choose an origin-point. But there are other representations that don’t contain the extra coordinate. For example, instead of saying each particle’s position relative to some hallucinated “origin point”, I can tell you their positions relative to each other, and thus remove the redundancy. Perhaps this sort of thing can always be done when there’s a redundancy, such physics does in fact have a K-complexity that’s about as low as its alt-complexity.”

That’s a fine hypothesis! Having stated it, you’re presumably prepared to update against it, in the face of contrary evidence. Of which there’s plenty.

For one thing, even if you’re giving the positions of particles relative to other particles (rather than to an origin), you still have a whole host of arbitrary choices to make, such as what order to walk through the particles in when you’re producing all these relative positions.

For another thing, the laws of physics just seem to be really very adamant about the idea that reality can only be specified relative to some arbitrary choices-of-convention, with the property that physics works no matter which arbitrary choice you make, but that you do have to make some choice. If I understand my physics correctly, this is more-or-less one of the core ideas at the heart of gauge theory (though, caveat, my grasp on gauge theory is more tenuous than my grasp on basic classical and quantum mechanics).

Like, very roughly speaking, when you start trying to make quantum mechanics play nice with relativity, the laws of physics glance at the idea that you might need to specify an origin point, scoff, and then ask you to to also specify a continuous function from spacetime to the unit circle. As a warm-up. (It’s the U(1) part of the SU(3) × SU(2) × U(1) symmetry, if I understand correctly.)

And—if I understand correctly, and again noting that my grasp on gauge theory is somewhat tenuous—you can’t get away from picking some gauge, and this fact has real physical effects, such as photons.^[2]

If I’m understanding it correctly, physics really goes ham on the idea that you’ve gotta make a whole lot of arbitrary choices before you can start describing a physical system at all. These choices wash out (in the sense that you can describe your system similarly-easy ~no matter which choice you make), which is why I call them arbitrary, but it sure does look like physics itself is giving a strong vote in favor of alt-complexity over K-complexity.

In other words: insofar as you buy Occam’s razor, which says that reality itself is supposed to turn out to be ‘simple’, reality itself gets some say in what counts as ‘simple’. And it sure looks to me like reality has a lot more alt-simplicity than it has K-simplicity. And so K-complexity is just not a very good measure of the “simplicity” that actual physics actually possesses.

ETA: Various folk in the comments have pointed out that the K-complexity and the alt-complexity differ by an amount bounded by a constant (that depends on $P$ but not on $f$ ). This is cool, and a stronger relationship than I had known, but (to state the obvious) it doesn’t much undermine the point that our laws of physics seem to prefer alt-complexity to K-complexity.

For instance, suppose (very generously) that the constant is 10 bits. This would mean that no reality can get more than 1024 times as much probability-mass concentrated in an enormous cluster of long programs (such as “our laws of physics, with an arbitrary gauge hard-coded”) than in some single shorter program (perhaps one that just iterates through all gauges, performing some kind of correctness check).

That’s interesting, but it leaves open an empirical question of whether the actual world we find ourselves in is in fact one with most of its support coming from a single short program. Like, at least 1/1025th of reality’s support must come from a single short program, but some realities will have 99% of their support coming from a single short program, whereas others will flirt with the ~0.1% boundary. Which sort of reality are we dealing with?

When we look around, we see a reality that seems to be better-described by an enormous regular cluster of long programs, by a hefty factor (perhaps 512, in this example).

We could have seen reality be almost as K-simple as it is alt-simple, but in fact it looks to be significantly more K-complex than it is alt-complex. That seems to me like empirical evidence validating the theory (and intuition) that we should think in terms of program-clusters rather than in terms of shortest-programs.

Or in other words: sure, the K-complexity and the alt-complexity of the universe differ by at most an additive constant that depends only on $P$ and not on the universe, but “you will get no more than $c$ bits of empirical evidence favoring alt-complexity over K-complexity” leaves open the possibility of getting quite a few bits, and that looks to me like what happened.

Conjectures

I’ve long felt that algorithmic information theory is kinda janky and annoying. I conjecture that this is because it’s using K-complexity instead of alt-complexity as one of its central notions.

For instance, recall that Solomonoff induction doesn’t converge on the hypothesis with lowest K-complexity. It converges on the hypothesis with lowest alt-complexity instead. If you want to prove something like “Solomonoff induction converges on the hypothesis with lowest K-complexity”, you’ll have to construct some sort of awkward repeated game that’s somehow driving a wedge between the shortest program and all the other competing programs, and then argue something about how the difference eventually gets arbitrarily small, or something. Which is tedious and kinda insane, compared to the theorem saying that Solomonoff induction converges on the hypothesis with lowest alt-complexity, which is practically by-definition.

I haven’t looked much at algorithmic information theory texts since developing the idea of alt-complexity, but if my vague memories serve, lots of the theorems in algorithmic information theory felt to me like they were tedious and kinda insane in this way. I think there’s a decent chance that a variety of theorems in algorithmic information theory could be cleaned up by replacing K-complexity with alt-complexity.

(This should at least be true of theorems relating “algorithmic entropy” to the notion of entropy used in other parts of information theory and in physics. alt-complexity has a very simple and clean relationship to entropy, as noted above; the relationship between K-complexity and entropy is much more fraught, and probably requires some repetition-based wedge-driving crap.)

EDIT: see interstice’s comment here; K-complexity and alt-complexity differ by at most a constant (that depends only on $P$ and not on $f$ ), so the conversions can’t be that bad, and I now doubt that the difference accounts for as much annoyance as I originally conjectured. (Though it still seems plausible to me that various theorems are obscured by the use of K-complexity instead of alt-complexity.)

Acknowledgements

I have been using a vague/implicit concept of alt-complexity in my own notes for a number of years, but didn’t develop the idea explicitly until a recent conversation with Adam Scherlis right here on LessWrong a couple months ago. (You’ll have to click ‘see in context’ to see the whole comment thread.)

As an aside, I recommend reading that comment thread if you’re interested in watching two people with different intuitions go back-and-forth until they successfully transmit their points to each other, and come away with more understanding (and cooler ideas) than either went in with.

(The primary thing we figured out together was not so much alt-complexity, as what “entropy” should mean and how it relates to complexity, but I also refined the concept of alt-complexity in that interaction. Thanks Adam!)

Takeaways

I have personally abandoned the concept of K-complexity, in favor of alt-complexity. I recommend it.

In my own notes, I call this new concept “complexity”. Sometimes I call it “ $P$ -complexity”, where $P$ is the programming language, when I want to make that dependence explicit. If “complexity” isn’t specific enough to be unambiguous, I encourage commenters to brainstorm alternative names, before “alt-complexity” sticks.

↩︎
K-complexity performs better when the programming language $P$ is a language of probabilistic programs, rather than deterministic ones. That said, this is basically just a way of bringing K-complexity closer to the correct notion of alt-complexity, and the pathologies discussed above can still be exhibited in the case of probabilistic programs.

Furthermore, if you’re working with alt-complexity instead of K-complexity, there’s less need to work with probabilistic programs. You can just work with collections of deterministic programs instead. These views can be brought closer together by thinking of computer programs not as finite codes, but as infinite bitstrings, which we think of as a prefix-free code for a finite program plus a random seed. But (a) that would be a digression from my main point, and also (b) I’m still a bit confused about something to do with representations of programs in Solomonoff induction, so that’s all I’ll say on the topic here.
↩︎
Here’s my current understanding of the story, as I shall now briefly attempt to render into (local) layperson’s terms. I enthusiastically solicit corrections from people who understand the theory well enough to critique me.

Quantum mechanics tells us that the nature of reality is a (certain sort of well-behaved) function from possible-configurations-of-the-entire-universe to complex numbers (called the “amplitude” of the configuration).

The laws of physics are phase-invariant, in the sense that if we rotate all of those complex numbers by the same fixed angle, then this doesn’t change anything.

This picture does a great job at predicting certain atomic processes, but it’s not very compatible with relativity. Because, like, once you’ve thrown simultaneity out the window, concepts like “ways that the universe can be configured across all of space in a single instant” are looking pretty shaky.

For some reason I haven’t understood yet, some madlad physicist was like “ok, but what if we postulate some sort of superpowered version of phase invariance, where the angle that we’re rotating each of the complex numbers by depends on where things are in space.” By doing something like(?) choosing a function from spacetime to the unit cicrle U(1), and taking each configuration on $n$ particles, and multiplying its amplitude by the $n$ different points on U(1) given by the map at the $n$ different particle positions (according to this configuration)? I think? So far I’ve mostly stared at simple versions of the equation with 1-particle systems, and haven’t managed to understand the texts about the more general case here, and it’s also plausible to me that there’s $n$ different functions from spacetime to U(1), or that I’ve totally misunderstood what I’m doing. Clarification is welcome. But the point is, we can hypothesize some sort of function(s) to U(1), and we can somehow imagine perturbing the wave function not by a uniform phase-change, but by a spacetime-dependent phase change.

And you might hope that the laws of physics would be invariant under these weird spacetime-dependent phase changes.

But that’s also kinda ridiculous, right? Like, it’s one thing to have laws of physics that don’t depend on whether you think the center of the world is located in New York or Chicago or LA. But it’s another thing entirely to say that I’m allowed to give you an arbitrary topological map of the universe where I get to make up all the distances according to my whims, and say that the laws of physics surely shouldn’t be invariant to that. Like, if I arbitrarily declare that the distance from New York to Chicago is a million distance-units, and the distance from Chicago to Los Angeles is sixteen distance-units, then the laws of physics must say that a package travelling down a frictionless vacuum-tube from New York to Los Angeles must experience a dramatic spontaneous slowdown sometime after crossing Chicago, for no apparent reason, contra normal laws like “momentum is conserved”. You can’t expect physics to be invariant if you start smacking its parameters differently in different places, right? That should have visible effects, in regions where your arbitrary choices change!

And, like, if you actually take a perfectly good quantum system of electrons and smack it with a (non-constant) map (from spacetime) to U(1), then the electrons will indeed do some jerking-around, when they move through regions where the map changes. So we can tell the difference between perturbing phases in this way, and not perturbing phases in this way.

Or, at least, we would be able to tell the difference, so long as there wasn’t some other sort of particles that reacted to the map in a sort of “equal and opposite” way. Such that, like, you think the map to U(1) is constant but there’re some other particles floating around that bop the electrons causing them to jerk, and I think that the map to U(1) is non-constant and there are no such particles.

And so this theory can be empirically tested! Quick, look around: do you notice electrons jerking around? Like, are the electrons in your photoreceptors jiggling such that you can see anything at all?

Because that second field of particles totally turns out to exist! It’s the photon field! And it’s coupled to the electrical field in just such a way that the laws of physics are invariant under spacetime-dependent phase-changes; everyone agrees about what happens, but different observers (with different maps to U(1)) disagree about which electron-jerks were caused by photons!

The laws of physics are apparently so adamant about being invariant under a spacetime-dependent change-of-phase that reality would rather bolt on a whole new breed of particle (whose presence in any particular case different people can disagree about) than violate this wacky invariance.

(From which we learn something about the language in which physics is itself simple: it is a language in which particle-fields are cheaper than symmetry-violations, even for symmetries that (so far) seem pretty wacky to me.)

What does this mean, for our purposes? Well, if I understand correctly, there is not (in general) a way to choose the map to U(1) such that there seem to be no photons. (You can probably choose a map to U(1) such that any given photon looks ‘fictitious’, but you can’t in general make them all disappear simultaneously.) My guess is that there’s also not a canonical way to choose “the” constant map to U(1), similar to how you can’t just choose “the” rest frame in relativity: somebody’s gotta pick out the basis vectors, and picking them out involves choosing quite a lot of data.

I again caveat that my grasp on gauge theory is rather tenuous, and I solicit better explanations from people who have done more than just staring at the quantum electrodynamic Dirac Lagrangian for a few hours.

What links here?

So8res20 Dec 2022 23:06 UTC

141 points

54 comments4 min readLW link 2 reviews

Kolmogorov Complexity Solomonoff Induction Entropy World Modeling

Buck 26 Dec 2023 16:54 UTC
67 points
9
This post is correct, and the point is important for people who want to use algorithmic information theory.
But, as many commenters noted, this point is well understood among algorithmic information theorists. I was taught this as a basic point in my algorithmic information theory class in university (which tbc was taught by one of the top algorithmic information theorists, so it’s possible that it’s missed in other treatments).
I’m slightly frustrated that Nate didn’t realize that this point is unoriginal. His post seems to take this as an example of a case where a field is using a confused concept which can be identified by careful thought from an outsider. I think that outsiders sometimes have useful ideas that insiders are missing. But in this case, what was actually happening is that Nate just didn’t understand how insiders think about this, and had recovered from that by noticing his mistake directly himself (which is cool), rather than by realizing that he’d misunderstood the insiders.
I’d be annoyed if LWers took this as an example of doing novel intellectual work and improving a field, rather than the fact that sometimes you rederive well-known facts.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- James Payor 19 Jan 2024 17:56 UTC
  1 point
  0
  Parent
  I see the main contribution/idea of this post as being: whenever you make a choice of basis/sorting-algorithm/etc, you incur no “true complexity” cost if any such choice would do.
  I would guess that this is not already in the water supply, but I haven’t had the required exposure to the field to know one way or other. Is this more specific point also unoriginal in your view?
  - Buck 19 Jan 2024 18:00 UTC
    2 points
    0
    Parent
    I think this point is obvious, but I don’t really remember what points were obvious when I took algorithmic info theory (with one of the people who is most likely to have thought of this point) vs what points I’ve learned since then (including spending a reasonable amount of time talking to Soares about this kind of thing).
Yoav Ravid 26 Dec 2023 14:39 UTC
3 points
1
Since simplicity is such a core heuristic of rationality, having a correct conceptualization of complexity is important.
The usual definition of complexity used on LW (and elsewhere) is Kolmogorov Complexity, which the length of its shortest code. Nate suggests a compelling idea—that another way to be simple is to have many different codes. For example, a state with five 3-bit codes is simpler than a state with one 2-bit code.
Nate justifies that with a bunch of math, and some physics, which I don’t understand well enough to comment on.
A bunch of people who do understand the math well enough basically agreed with Nate, but commented that Soares’s Complexity differes from Kolmogorov Complexity by at most a constant. After some discussion Nate was convinced and edited a note saying that into the post, but then @interstice said that this only applies to finite strings while Nate is talking about semimeasures on infinite sequences, which do have a non-constant gap.
So, though I don’t feel qualified to say it, my understanding is that this is indeed a superior measure of complexity and, at least on LW, should replace Kolmogorov Complexity as the default meaning of complexity.
And since improving our understanding of complexity seems quite important, I’m giving this post 4 points.
- Buck 26 Dec 2023 16:55 UTC
  4 points
  0
  Parent
  Note that Nate’s proposal is not original; this is him rederiving and popularizing points that are known, rather than learning new ones.
- Noosphere89 26 Dec 2023 17:50 UTC
  2 points
  0
  Parent
  Admittedly, as much as I do think that Kolmogorov Complexity is worse than Alt-Complexity, I do think that it has one particular use case that Alt-complexity does not have:
  
  It correctly handles the halting oracle case, and generally handles the ideal/infinite cases quite a lot better than Alt-complexity/Solomonoff log-probability, and this is a case where alt-complexity does quite a lot worse.
  
  Alt-complexity intelligence is very much a theory of the finite, and also a restrictive finite case at that, and doesn’t attempt to deal with the infinite case, or cases where it’s still finite but some weird feature of the environment allows halting oracles and I think they’re right, at least for the foreseeable future to ignore this case, but Kolmogorov Complexity definitely deals with the infinite/ideal cases way better than Alt-complexity when it comes to intelligence.
  
  Links are below:
  
  Alt-Complexity intelligence: https://www.lesswrong.com/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized#Evaluating_agents
  
  Kolmogorov Complexity intelligence: https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=Tg7A7rSYQSZPASm9s#Tg7A7rSYQSZPASm9s

interstice 21 Dec 2022 0:03 UTC
38 points
9
As John conjectured, alt-complexity is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant. See section 4.5 of this book for a proof. So I think the stuff about how physics favors alt-complexity is a bit overstated—or at least, can only contribute a bounded amount of evidence for alt-complexity.

ETA: this result is about the complexity of finite strings, not semimeasures on potentially infinite sequences; for such semimeasures, there actually is a non-constant gap between the log-total-probability and description complexity.
What links here?
- K-complexity is silly; use cross-entropy instead by So8res (20 Dec 2022 23:06 UTC; 141 points)
- ryan_greenblatt's comment on Counting arguments provide no evidence for AI doom by Nora Belrose (28 Feb 2024 16:25 UTC; 5 points)
- So8res 21 Dec 2022 0:22 UTC
  12 points
  2
  Parent
  Cool, thanks! I wonder what that theorem has to say about gauge symmetries. Like, if I take the enormous pile of programs that simulate some region of physics (each w/ a different hard coded gauge), and feed the pile into that theorem, what “short” program does it spit out? (With scare quotes b/c I’m not sure how big this constant is yet.)
  
  I might poke at this later, but in the interim, I’m keen to hear from folk who already know the answer.
  - interstice 21 Dec 2022 2:07 UTC
    4 points
    0
    Parent
    The programs used in the proof basically work like this: they dovetail all programs and note when they halt/output, keeping track of how much algorithmic probability has been assigned to each string. Then a particular string is selected using a clever encoding scheme applied to the accrued algorithmic probability. So the gauge theory example would just be a special case of that, applied to the accumulated probability from programs in various gauges.
    - So8res 21 Dec 2022 15:45 UTC
      6 points
      0
      Parent
      Thanks! I’d need more detail than that to answer my questions.
      
      Like, can we specialize this new program to a program that’s just ‘dovetailing’ across all possible gauge-choices and then running physics on those? When we choose different gauges, we have to choose correspondingly-different ways of initializing the rest of the fields (or whatever); presumably this program is now also ‘dovetailing’ across all the different initializations?
      
      But now it’s looking not just at one history, but at all histories, and “keeping track of how much algorithmic probability has been assigned to each”. So it presumably has to pick one of them to “predict”, but how is it doing this?
      
      Like, I can vaguely understand the idea of (at any given timestep) partitioning the guage/initial-configuration pairs into clusters that haven’t been distinguished yet, and then sampling according to the size of the cluster. But presumably sampling isn’t good enough, because the random seed needed to hit a particular cluster of my choosing could be arbitrarily improbable (which doesn’t seem compatible with getting a difference-factor that’s constant in $f$ ).
      
      So I still feel like I’m missing a key idea for constructing a hypothesis that’s allegedly significantly simpler than “you live in physics, once for every possible choice of gauge”.
      
      (Or, well, I guess it’s simpler than “you live in physics, at thus-and-such a particular gauge”. It needn’t be simpler than the collection of hypotheses “you live in physics, once for every gauge”, and indeed, if the constant we’re talking about is (say) 10 bits, then it’s entirely possible for all but a thousandth-part of our probabliity mass to be concentrated in that collection. The fact that the amount of support reality provides for alt-complexity over K-complexity is bounded above by a constant that doesn’t depend on reality, doesn’t preclude different realities providing different amounts of evidence, and nor does it preclude our particular reality happening to provide quite a lot of evidence.)
      
      (But it does sound like the alternative hypotheses are going to have a character something like “you live in a machine that crawls across all gauges, feeds them into physics, and then «does something clever»”. Which is a little more resolution!)
      - interstice 21 Dec 2022 19:29 UTC
        6 points
        0
        Parent
        
        Like, can we specialize this new program to a program that’s just ‘dovetailing’ across all possible gauge-choices and then running physics on ?
        
        You could specialize to just dovetailing across gauges, but this might make the program longer since you’d need to specify physics.
        
        So it presumably has to pick one of them to “predict”, but how is it doing this?
        
        I don’t think you need to choose a particular history to predict since all observables are gauge-invariant. So say we’re trying to predict some particular sequence of observables in the universe. You run your search over all programs, the search includes various choices of gauge/initialization. Since the observables are gauge-invariant each of those choices of gauge generate the same sequence of observables, so their algorithmic probability accumulates to a single string. Once the string has accumulated enough algorithmic probability we can refer to it with a short codeword(this is the tricky part, see below)
        
        But presumably sampling isn’t good enough, because the random seed needed to hit a particular cluster of my choosing could be arbitrarily improbable
        
        The idea is that, using a clever algorithm, you can arrange the clusters in a contiguous way inside $[0, 1]$ . Since they’re contiguous, if a particular cluster has volume $2^{- Θ (n)}$ we can refer to it with a codeword of length $Θ (n)$ .
        So8res 21 Dec 2022 19:46 UTC
        6 points
        0
        Parent
        
        this might make the program longer since you’d need to specify physics.
        
        (I doubt it matters much; the bits you use to specify physics at the start are bits you save when picking the codeword at the end.)
        
        I don’t think you need to choose a particular history to predict since all observables are gauge-invariant.
        
        IIUC, your choice of the wavefunction and your choice of the gauge are interlocked. The invariant is that if you change the gauge and twiddle the wavefunction in a particular way, then no observables change. If you’re just iterating over (gauge, wavefunction) pairs then (iiuc) you’re going to hit lots of pairs that aren’t making the correct predictions about the observables.
        
        I don’t understand the mechanism that lets you can find some clever way to check which (gauge, wavefunction) pairs are legal (or at least haven’t-been-ruled-out-yet), nor how to make a deterministic prediction given an imperfect cluster without doing some sort of sampling.
        
        (Alternatively, it’s plausible to me that I’m thinking about it wrong.)
        
        … codeword …
        
        Yeah, I see how if you can do this dovetailing and get the clusters to be continuous, such that every cluster with $2^{k}$ programs of length $(n + k)$ has a codeword of length $n$ , then you’re done.
        
        (And it’s clear to me how to do this if you can enumerate the $2^{k}$ programs, and I know how to enumerate a superset of those programs b/c I know how to enumerate all the programs, but I don’t know how to computably distinguish members.)
        interstice 21 Dec 2022 20:19 UTC
        6 points
        0
        Parent
        
        I don’t understand the mechanism that lets you can find some clever way to check which (gauge, wavefunction) pairs are legal
        
        I’m not really sure how gauge-fixing works in QM, so I can’t comment here. But I don’t think it matters: there’s no need to check which pairs are “legal”, you just run all possible pairs in parallel and see what observables they output. Pairs which are in fact gauge-equivalent will produce the same observables by definition, and will accrue probability to the same strings.
        
        Perhaps you’re worried that physically-different worlds could end up contributing identical strings of observables? Possibly, but (a) I think if all possible strings of observables are the same they should be considered equivalent, so that could be one way of distinguishing them (b) this issue seems orthogonal to k-complexity VS. alt-complexity.
        So8res 21 Dec 2022 20:53 UTC
        6 points
        0
        Parent
        That gives me a somewhat clearer picture. (Thanks!) It sounds like the idea is that we have one machine that dovetails through everything and separates them into bins according to their behavior (as revealed so far), and a second machine that picks a bin.
        
        Presumably the bins are given some sort of prefix-free code, so that when a behavior-difference is revealed within a bin (e.g. after more time has passed) it can be split into two bins, with some rule for which one is “default” (e.g., the leftmost).
        
        I buy that something like this can probably be made to work.
        
        In the case of physics, what it suggests the hypothesis
        
        reality is the thing that iterates over all gauges, and applies physics to cluster #657
        
        as competing with a cluster of longer hypotheses like
        
        reality applies physics to gauge #25098723405890723450987098710987298743509
        
        Where that first program beats most elements of the cluster (b/c the gauge-identifier is so dang long in general), but the cluster as a whole may still beat that first program (by at most a fixed constant factor, corresponding roughly to the length of the iterating machine).
        
        (And my current state is something like: it’s interseting that you can think of yourself as living in one particular program, without losing more than a factor of «fixed constant» from your overall probability on what you saw. But I still don’t see why you’d want to, as opposed to realizing that you live in lots and lots of different programs.)
        interstice 21 Dec 2022 22:20 UTC
        7 points
        0
        Parent
        
        Presumably the bins are given some sort of prefix-free code, so that when a behavior-difference is revealed within a bin (e.g. after more time has passed) it can be split into two bins, with some rule for which one is “default
        
        I only just realized that you’re mainly thinking of the complexity of semimeasures on infinite sequences, not the complexity of finite strings. I guess that should have been obvious from the OP; the results I’ve been citing are about finite strings. My bad! For semimeasures, this paper proves that there actually is a non-constant gap between the log-total-probability and description complexity. Instead the gap is bounded by the Kolmogorov complexity of the length of the sequences. This is discussed in section 4.5.4 of Li&Vitanyi.
        What links here?
        Yoav Ravid's comment on K-complexity is silly; use cross-entropy instead by So8res (26 Dec 2023 14:39 UTC; 3 points)
        So8res 21 Dec 2022 22:34 UTC
        9 points
        6
        Parent
        Cool, thanks.
        
        I’m pretty confident that the set of compatible (gauge, wavefunction) pairs is computably enumerable, so I think that the coding theorem should apply.
        
        There’s an insight that I’ve glimpsed—though I still haven’t checked the details—which is that we can guarantee that it’s possible to name the ‘correct’ (gauge, wavefunction) cluster without necessarily having to name any single gauge (as would be prohibatively expensive), by dovetailing all the (guage, wavefunction) pairs (in some representation where you can comptuably detect compatibility) and binning the compatible ones, and then giving a code that points to the desired cluster (where the length of that code goes according to cluster-size, which will in general be smaller than picking a full gauge).
        
        This does depend on your ability to distinguish when two programs belong in the same cluster together, which I’m pretty sure can be done for (gauge, wavefunction) pairs (in some suitable representation), but which ofc can’t be done in general.
        
        Thanks for the reference!
        Yoav Ravid 12 Feb 2023 15:57 UTC
        2 points
        0
        Parent
        Seems good to edit the correction in the post, so readers know that in some cases it’s not constant.
        Adam Jermyn 22 Dec 2022 5:47 UTC
        1 point
        0
        Parent
        How does this correctness check work?
        
        I usually think of gauge freedom as saying “there is a family of configurations that all produce the same observables”. I don’t think that gives a way to say some configurations are correct/incorrect. Rather some pairs of configurations are equivalent and some aren’t.
        
        That said, I do think you can probably do something like the approach described to assign a label to each equivalence class of configurations and do your evolution in that label space, which avoids having to pick a gauge.
- interstice 21 Dec 2022 2:12 UTC
  4 points
  0
  Parent
  It’s interesting that K-complexity has this property that there is always a single explanation which dominates the space of possible explanations. Intuitively it seems like there are often cases where there are many qualitatively different yet equally valid explanations for a given phenomenon. I wonder if there are natural complexity measures with this property as well—I think that this might be the case for Levin complexity^[1], and this could provide an interesting way of defining ‘emergence’ quantitatively.
  ↩︎
  Or not?? I haven’t read this paper closely yet but sounds like it might prove a similar “one hypothesis dominates” theorem for Levin complexity
  - Donald Hobson 21 Dec 2022 20:18 UTC
    7 points
    4
    Parent
    This isn’t what is proved. What is proved is that there is some constant K, dependent on choice programming language, such that Komolgorov complexity is dominated by at most K explanations. The proof provides an upper bound on K, as the exponential of the length of program needed to do various bits of fiddling around with encoding schemes. In other words, K may well be >10^100 for sensible choices of programming language.
    - interstice 21 Dec 2022 20:24 UTC
      2 points
      0
      Parent
      Fair. I was summarizing “dominates within a constant factor” as just “dominates”, which admittedly is not totally accurate ;)
- Charlie Steiner 21 Dec 2022 3:11 UTC
  2 points
  0
  Parent
  Yeah, I don’t think “evidence for alt-complexity” is even the right way to describe what’s going on. It’s about what ways to describe hypotheses about the world are convenient for humans to do science with,
  - interstice 21 Dec 2022 4:20 UTC
    2 points
    0
    Parent
    The OP is about which measure assigns a higher value to our universe, not just which one is more convenient.
johnswentworth 20 Dec 2022 23:38 UTC
25 points
16
I’d guess that cross-entropy/alt-complexity differs from K-complexity by at most a constant? I don’t have a proof here or even a full argument, but rough intuition is that, if the same function can be represented k different ways using a program of length n, then we should be able to specify the function using n—log(k) bits rather than n bits (modulo a constant to handle the extra overhead of the scheme). Similar to how, if we need to specify one of k possible incompressible n-bit strings and we don’t care which one, then that should require n—log(k) bits on average.
That said, good post, in hindsight it totally makes sense that this alternative complexity measure is the more natural thing to think about.
What links here?
- Antoine de Scorraille's comment on Shannon’s Surprising Discovery by johnswentworth (3 Jun 2023 13:16 UTC; 3 points)
- paulfchristiano 21 Dec 2022 0:04 UTC
  17 points
  7
  Parent
  I also have the cached belief/intuition that the gap is at most an additive constant, and I think that’s probably why people in the field don’t sweat the difference, given that everything is only defined up to additive constants anyway.
  Agree that alt-complexity is more natural from a wide variety of perspectives. But if the additive constant gap is true then I don’t think it’s a big deal technically (though it may still be significant pedagogically, and it’s weird that it’s not easy to find a reference for this claim if it’s needed for k-complexity to be a reasonable definition).
davidad 20 Dec 2022 23:39 UTC
16 points
9
Strong upvoted.

Bird’s eye perspective: All information theory is just KL-divergence and priors, all priors are just Gibbs measures, and algorithmic information theory is just about how computational costs should be counted as “energy” in the Gibbs measure (description length vs time vs memory, etc).

Frog’s eye perspective: taking the pushforward measure along the semantics of the language $P$ collects all the probability mass of each $f$ ’s entire extensional-equivalence-class; no equally natural operation collects only the mass from the single maximum-probability representative.

Both seem underrated.
Vanessa Kosoy 21 Dec 2022 8:30 UTC
13 points
0
Other people have noted that Solomonoff log-probability differs from Kolmogorov complexity only by a constant. But there’s another similar pair of objects I’m interested in, where I don’t know whether the analogous claim holds. Namely, in my original definition of the AIT intelligence measure, I used Kolmogorov complexity, because I implicitly assumed it’s the same as Solomonoff log-probability up to a constant. But Alex questioned this claim, which is why I switched to Solomonoff log-probability when writing about the physicalist version (see Definition 1.6 here). The crucial difference between this and the question in the OP is, we’re looking at programs selected by Solomonoff-expectation of something-to-do-with-their-ouput, rather than directly by their output (which places us on different spots on the computability ladder). If the two are different then I’m pretty sure Solomonoff log-probability is the correct one, but are they? I would be very interested to know.
- Noosphere89 31 Jul 2023 20:59 UTC
  5 points
  0
  Parent
  It’s not differing by a constant, at least in some situations.
  
  Here’s interstice’s comment below, reproduced:
  
  I only just realized that you’re mainly thinking of the complexity of semimeasures on infinite sequences, not the complexity of finite strings. I guess that should have been obvious from the OP; the results I’ve been citing are about finite strings. My bad! For semimeasures, this paper proves that there actually is a non-constant gap between the log-total-probability and description complexity. Instead the gap is bounded by the Kolmogorov complexity of the length of the sequences. This is discussed in section 4.5.4 of Li&Vitanyi.
  
  Link to the paper is below:
  
  https://www.cs.bu.edu/faculty/gacs/papers/Gacs81.pdf
Steven Byrnes 21 Dec 2022 16:11 UTC
7 points
0
I don’t know the order that things were discovered historically, but if it helps, here’s how I understood the motivation of gauge theory when it was first introduced in undergrad; we started with basic electromagnetism and wound up with the idea “Given a quantum particle with nonzero electric charge, you can just pick what phase its wavefunction has, however you want, arbitrarily, and you can choose it independently for every point in spacetime”. Here goes:
Classically, there’s an electric field E (a vector at each point of spacetime) and a magnetic field B (a vector at each point of spacetime).
Gauss’s law for magnetism says $\nabla \cdot B = 0$ , which turns out to be exactly equivalent to the claim “we can write B in the form $B = \nabla \times A$ “ where A is some other field (vector at each point of spacetime) called “vector potential”. As it turns out, there are many different A’s that yield the same B, and they’re all equally correct.
Likewise, the Maxwell-Faraday Equation is exactly equivalent to the claim “we can write E in the form $E = - \nabla ϕ - \partial A / \partial t$ , where ɸ is some scalar field (real number at each point of spacetime) called “scalar potential”. Again, as it turns out, there are many different ɸ’s that are equally valid.
This is progress! By working with ɸ & A instead of E & B, we only have to calculate four components instead of six, and we only need to think about two Maxwell’s equations instead of four.
(In special relativity, it turns out that ɸ & A come together into the four components of the “electromagnetic 4-potential”.)
Gauge freedom says we can pick any scalar field Λ (real number at each point of spacetime), and replace $(ϕ, A)$ with $(ϕ - \partial Λ / \partial t, A + \nabla Λ)$ , and it doesn’t change anything about the real world. You still get the same E & B.
So far we’re still in classical electromagnetism.
Now we switch to single-particle nonrelativistic semiclassical quantum mechanics (i.e., E & B are still classical fields, but there’s a quantum charged particle). It turns out that there’s still gauge freedom, but you also need to multiply the wavefunction of the charged particle by a position-dependent phase factor:
$(ϕ, A, ψ) \mapsto (ϕ - \frac{\partial Λ}{\partial t}, A + \nabla Λ, e^{i q Λ / ℏ} ψ)$
where $ψ$ is the wavefunction of our particle with charge q.
And now it’s obvious that you have complete freedom to mess with the time&position-dependent phase of $ψ$ however you want, just by choosing an appropriate Λ.
- So8res 21 Dec 2022 16:40 UTC
  6 points
  2
  Parent
  Thanks! One place where I struggle with this idea is that people go around saying things like “Given a quantum particle with nonzero electric charge, you can just pick what phase its wavefunction has”. I don’t know how to think of an electron having a wavefunction whose phase I can pick. The wavefunctions that I know assign amplitudes to configurations, not to particles; if I have a wavefunctionover three-electron configurations then I don’t know how to “choose the phase” for each electron, because a three-particle wave-functions doesn’t (in general) factor as a product of three one-particle wave-functions.
  
  I can make guesses about what it means to “choose electron phases” when there’s multiple electrons, but if Vanessa’s right (and in my experience she usually is about this sort of thing), then my initial guess was wrong.
  
  (I appreciate how you gave the equation for how a one-particle wavefunction $ψ$ changes with a change in gauge, but it precisely dodges the case I was unclear about; what I was unclear on is how the wavefunction changes with the gauge when a configuration comprises two or more charged particles. Overall you’ve given quite a lovely demonstration of how the standard story manages to leave me unclear on this key point :-p.)
  
  That said, your comment seems to me like fine exposition about how one might stumble across the electromagnetic 4-potential, and discover that it has gauge freedom. Thanks again!
  - Steven Byrnes 21 Dec 2022 17:02 UTC
    5 points
    0
    Parent
    My recollection is that in nonrelativistic N-particle QM you would have a wavefunction $ψ (r_{1}, r_{2}, \dots, r_{N})$ (a complex number for every possible choice of N positions in space for the N particles), and when you change gauge you multiply that that wavefunction by $e^{i q_{1} Λ (r_{1}) / ℏ} e^{i q_{2} Λ (r_{2}) / ℏ} \dots e^{i q_{N} Λ (r_{N}) / ℏ}$ . I think this is equivalent to what Vanessa said, and if not she’s probably right :-P
    - So8res 21 Dec 2022 17:08 UTC
      4 points
      0
      Parent
      That was my original guess! I think Vanessa suggested something different. IIUC, she suggested
      
      $ψ (r_{1}, r_{2}, \dots, r_{N}) \mapsto N \prod k = 1 (e^{i q_{k} Λ (r_{k}) / ℏ} ψ (r_{1}, r_{2}, \dots, r_{N}))$
      
      which has $N$ factors of the wavefunction, instead of 1.
      
      (You having the same guess as me does update me towards the hypothesis that Vanessa just forgot some parentheses, and now I’m uncertain again :-p. Having $N$ factors of the wavefunction sure does seem pretty wacky!)
      
      (...or perhaps there’s an even more embarassing misunderstanding, where I’ve misunderstood physicist norms about parenthesis-insertion!)
      - Steven Byrnes 21 Dec 2022 17:32 UTC
        5 points
        3
        Parent
        I think it’s just one copy of the $ψ$ —I don’t think Vanessa intended for $ψ$ to be included in the Π product thing here. I agree that an extra pair of parentheses could have made it clearer. (Hope I’m not putting words in her mouth.)
Algon 21 Dec 2022 1:06 UTC
4 points
0
For some reason I haven’t understood yet, some madlad physicist was like “ok, but what if we postulate some sort of superpowered version of phase invariance, where the angle that we’re rotating each of the complex numbers by depends on where things are in space.”
This just corresponds to the gauge symmetry of the classical EM field. That was the first gauge theory, but there you added a field to another field instead of multiplying a field by a field dependant factor, subject to some constraints. That was the first gauge theory. General relativity is another example of a gauge theory, with gauge being determined by co-ordinate transformations. That was the second gauge theory, or at least the second one that was apparent. With two historical examples of local gauge invariance, in particular the classical analogue of QED, I think the story of how U(1) local invariance was discovered becomes less strange. Though admittedly, you have to wonder why it took so decades to be discovered.
By doing something like(?) taking a configuration on $n$ particles, and multiplying the amplitude of that configuration by the $n$ different points on the unit circle (aka U(1)) corresponding to the positions of each of the $n$ different particles? I think? So far I’ve mostly stared at simple versions of the equation with 1-particle systems, and haven’t managed to fully understand the texts about the more general case here, and it’s also plausible to me that there’s $n$ different functions from spacetime to U(1), or that I’ve totally misunderstood what I’m doing.
Not sure what you mean by this. Maybe you mean a field with n excitations, but that’s a weird way to put it. I’m guessing you’re talking about a QM system with n particles? In which case, you can ask for the state space of n particles to be invariant under n copies of a constant-across-space U(1) transformation. That’s just the state space of n qubits! I’d recommend looking at the first 2-3 chapters of Peter Woit’s book on this, as it is unusually clear. He has a free draft version on his university web page. But that leaves me confused as to why you’d bring that up in the context of QFT.
What does this mean, for our purposes? Well, if I understand correctly, there is not (in general) a way to choose the map to U(1) such that there seem to be no photons. (You can probably choose a map to U(1) such that any given photon looks ‘fictitious’, but you can’t in general make them all disappear simultaneously.) My guess is that there’s not a canonical way to choose “the” constant map to U(1), similar to how you can’t just choose “the” rest frame in relativity: somebody’s gotta pick out the basis vectors, and picking them out involves choosing quite a lot of data.
You certainly can’t eliminate all photons from spacetime, in general. And there is no canonical choice of gauge. Just use whatever simplifies the calculations, like you would in classical EM.
- So8res 21 Dec 2022 1:37 UTC
  2 points
  0
  Parent
  Thanks! I think I’m confused on a more basic step, which is, like, what exactly is the purported invariance? Consider a Shrödinger-style wave function on a configuration-space with two particles (that have positions, so 6 dimensions total). I know what it means that this wave-function is phase-invariant (if I rotate all amplitudes at once then the dynamics don’t change). What exactly would it mean for it to be “locally phase-invariant”, though?
  
  As a sub-question, what exactly is the data of the local phase invariance? A continuous function from space to U(1)? Two continuous functions, because we have two particles? Am I not even supposed to try to name the data before switching over to thinking of a whole particle field, instead of “two particles”?
  
  As a second sub-question, how exactly do I apply this data to the wave-function? I know how to multiply a wave-function by a single (global) phase, but I don’t know how to multiply a wave-function by (say) a function-from-space-to-U(1).
  
  (I am aware that these questions aren’t quite in the usual ontology of QFT; this is part of why I haven’t been able to quickly extract the answers to my questions from the texts.)
  - Vanessa Kosoy 21 Dec 2022 8:39 UTC
    5 points
    0
    Parent
    You don’t need QFT here, gauge invariance is a thing even for non-relativistic quantum charged particles moving in a background electromagnetic field. The gauge transformation group consists of (sufficiently regular) functions $g : R^{4} \to U (1)$ . The transformation law of the $n$ -particle wavefunction is:
    
    $ψ^{'} (x_{1} \dots x_{n}, t) = n \prod i = 1 g (x_{i}, t)^{q_{i}} ψ (x_{1} \dots x_{n}, t)$
    
    Here, $q_{i}$ is the electric charge of the $i$ -th particle, in units of positron charge.
    
    In math-jargony terms, the wavefunction is a section of the line bundle
    
    $W := n ⨂ i = 1 ({p r}_{i} \times {i d}_{R})^{*} L^{\otimes q_{i}}$
    
    Here, ${p r}_{i} : (R^{3})^{n} \to R^{3}$ is the projection to the position of the $i$ -th particle and $L$ is the “standard” line bundle on $R^{4}$ on which the electromagnetic field (the 4-potential $A$ , which is useful here even though the setting is non-relativistic) is a connection. $W$ has an induced connection, and the electromagnetic time-dependent Shroedinger equation is obtained from the ordinary time-dependent Shroedinger equation by replacing ordinary derivatives with covariant derivatives.
    What links here?
    So8res's comment on K-complexity is silly; use cross-entropy instead by So8res (21 Dec 2022 16:40 UTC; 6 points)
    Steven Byrnes's comment on K-complexity is silly; use cross-entropy instead by So8res (21 Dec 2022 17:32 UTC; 5 points)
    Slider's comment on Slider’s Shortform by Slider (21 Dec 2022 10:01 UTC; 2 points)
    - So8res 21 Dec 2022 15:16 UTC
      2 points
      0
      Parent
      Thanks! My top guess was
      
      $(n \prod i = 1 g (x_{i}, t)^{q_{i}}) ψ (x_{1} \dots x_{n}, t),$
      
      so I much appreciate the correction.
      
      ...actually, having $n$ factors of $ψ$ feels surprising to me; like, this map doesn’t seem to be the identity when $g$ is trivial; did you forget some parentheses? (Or did I misunderstand the parenthesis-insertion conventions?)
      
      re: bundles, I don’t get it yet, and I’d appreciate some clarification. Perhaps I’m simply being daft, but on my current understanding, $ψ : {(R^{3})}^{n} \times R \to C,$ and I’m not quite seeing which bundle it’s supposed to be a section of.
      
      Like, my understanding of bundles is that a bundle constitutes some total space $T$ and some base space $B$ and a projection $π : T \to B$ , with a section being a map back $σ : B \to T$ such that $π \circ σ = {i d}_{B}$ .
      
      Presumably $W$ here is supposed to be the projection $π$ , and $ψ$ is supposed to be the section $σ$ , with the total space and base space left implicit. But presumably the total space isn’t $C$ , and so I’m struggling to see how $ψ$ could have type $B \to T$ , so I’m clearly missing something. What’s the base space? What’s the total space? How do we re-view $ψ$ as being a map from the base space back up to the total space? In what sense is $ψ$ inverted by $W$ ?
      - Vanessa Kosoy 22 Dec 2022 12:00 UTC
        6 points
        0
        Parent
        Your guess is exactly what I meant. The $ψ$ is outside the product, otherwise this expression is not even a valid group action.
        
        Now, about bundles.
        
        As you said, a bundle over a manifold $B$ is another manifold $T$ with a projection $π : T \to B$ s.t. locally it looks like a product. Formally, every $p \in B$ should have an open neighborhood $U$ s.t. there is a diffeomorphism between $π$ restricted to $π^{- 1} (U)$ and a projection $p r : U \times F \to U$ for some manifold $F$ (the “fiber”).
        
        A vector bundle is a bundle equipped with additional structure that makes every fiber a vector space. Formally, we need to have a smooth addition mapping $+ : T \times_{B} T \to T$ and a multiplication-by-scalar mapping $\cdot : C \times T \to T$ which are (i) morphisims of bundles and (ii) make every fiber (i.e. the inverse $π$ -image of every point in $B$ ) into a vector space. Here, $\times_{B}$ stands for the fibre product (the submanifold of $T \times T$ given by $π (x) = π (y)$ ). I’m using $C$ here because we will need complex vector bundles.
        
        A line bundle is just a vector bundle s.t. every fiber is 1-dimensional.
        
        A Hermitian vector bundle is a vector bundle equipped with a smooth mapping of bundles $⟨ ⟩ : T \times_{B} T$ which makes every fibre into an inner product space.
        
        Onward to quantum mechanics. Let $X$ be physical space and $Y := X \times R$ physical spacetime. In the non-relativistic setting, $Y$ is isomorphic to $R^{4}$ , so all Hermitian line bundles over $Y$ are isomorphic. So, in principle any one of them can be identified with the trivial bundle: total space $Y \times C$ with $π$ being the canonical projection. However, it is often better to imagine some Heremitian line bundle $L$ without such an identification. In fact, choosing an identification precisely corresponds to choosing a gauge. This is like how all finite dimensional real vector spaces are isomorphic to $R^{n}$ but it is often better not to fix a particular isomorphism (basis), because that obscures the underlying symmetry group of the problem. For finite dimensional vector spaces, the symmetry group is the automorphisms of the vector space (a group isomorphic to $G L (n)$ ), for bundles it is the automorphism group of the bundle (= the group of gauge of transformations).
        
        So, let’s fix a Hermitian line bundle $L$ on $Y$ . This allows constructing a Hermitian line bundle $W$ on $X^{n} \times R$ (where $n$ is the number of particles) using the equation I gave before. That equation involves the operations of tensor product and pullback-by-mapping for bundles. I can explain, but maybe you can guess how they are defined (just imagine what it should do to every fibre, and then there is only one reasonable way to “glue” it together). If we fix an isomorphism between $L$ and the trivial bundle over $Y$ (=gauge) then it induces an isomorphism between $W$ and the trivial bundle over $X^{n} \times R$ . In this picture, saying that $ψ$ is a section of $W$ amounts to saying it is a mapping $X^{n} \times R \to X^{n} \times R \times C$ which is compatible with the projection. The latter condition just means it is the identity on the $X^{n} \times R$ component of the output, so all the information is in the $C$ component on the output, reducing it to a mapping $X^{n} \times R \to C$ .
        
        This way, in every particular gauge the wavefunction is just a complex function, but there is a sense in which it is better to avoid fixing a gauge and think of the wavefunction as a section of the somewhat abstract bundle $W$ . Just like a vector in a finite dimensional vector space can be thought of as a column of numbers, but often it’s better to think of it as just an abstract vector.
        So8res 23 Dec 2022 3:42 UTC
        2 points
        0
        Parent
        Thanks!
        
        I’m not entirely sure that I follow the construction of $W$ yet.
        
        Let’s figure out the total space. If you just handed me a line bundle $L$ on $Y = X \times R$ , and were like “make a bundle on $X^{n} \times R$ ”, then the construction that I’d consider most obvious would be to make the total space be the pullback of $total-space (L)^{n}$ such that all of the time-coordinates agree...
        
        ...ah, but that wouldn’t be a line bundle; the tangent space would be $n$ -dimensional. I see.
        
        You suggested starting by considering what happens to an individual fiber, which… is an easier operation to do when I already know what the total space is, but whatever, a particular fiber is isomorphic to $C$ , and it’s over a $(x_{1}, \dots, x_{n}, t) : X^{n} \times R$ , and… yeah I’m not seeing how this helps me figure out the total space.
        
        A third stab at figuring out the total space of $W$ : maybe we take $X^{n} \times L$ , and now we have (in some sense) an “extra” $Y$ that we want to somehow cancel out. If $X$ were equipped with some binary operation $\cdot$ , then the projection from $X^{n} \times L$ to $X^{n} \times R$ could, like, first project out $(x_{1}, \dots, x_{n})$ from the left, and then project out $(x, t)$ from $L$ , and then produce $(x_{1} \cdot x, \dots, x_{n} \cdot x, t)$ . But I don’t see any reasonable “multiplication” operation to fill that role.
        
        I could maybe figure it out if I stared at it longer (as does sound fun), but my first three idiot ideas off the top of my head have not succeeded at extracting an answer to my question “what is the total space of $W$ ?” from your text.
        
        (Though, for the record, it makes sense to me how a wave function $ψ$ is a section of the trivial line bundle $X^{n} \times R \times C$ , and how we can more generally ask for sections of line bundles over $X^{n} \times R$ and thereby get a more abstract theory. The part that’s not immediately-clicking for me is the part where we can extend a line bundle over $Y$ to a line bundle over $X^{n} \times R$ in an obvious way.)
        Vanessa Kosoy 24 Dec 2022 18:27 UTC
        4 points
        0
        Parent
        There are two operations involved in the definition of $W$ : pullback and tensor product.
        
        Pullback is defined for arbitrary bundles. Given a mapping $f : X \to Y$ (these $X$ and $Y$ are arbitrary manifolds, not the specific ones from before) and a bundle $B$ over $Y$ with total space ${T o t}_{B}$ and projection mapping ${p r}_{B} : T o t (B) \to Y$ , the pullback of $B$ w.r.t. $f$ (denoted $f^{*} B$ ) is the bundle over $X$ with total space $T o t (B) \times_{Y} X$ and the obvious projection mapping. I remind that $T o t (B) \times_{Y} X$ is the fibre product, i.e. the submanifold of $T o t (B) \times X$ defined by ${p r}_{B} (t) = f (x)$ . Notice that the fibre of $f^{*} B$ over any $x \in X$ is canonically isomorphic to the fibre of $B$ over $f (x)$ . The word “canonical” means that there is a particular isomorphism that we obtain from the construction.
        
        It is easy enough to see that the pullback of a vector bundle is a vector bundle, the pullback of a line bundle is a line bundle, and the pullback of a Hermitian vector bundle is a Hermitian vector bundle.
        
        Tensor product is an operation over vector bundles. There are different ways to define it, corresponding to the different ways to define a tensor product of vector spaces. Specifically for line bundles there is the following shortcut definition. Let $L_{1}$ and $L_{2}$ be line bundles over $X$ . Then, the total space of $L_{1} \otimes L_{2}$ is the quotient of $L_{1} \times_{X} L_{2}$ by the equivalence relation given by: $(v_{1}, v_{2}) \sim (w_{1}, w_{2})$ iff $v_{1} \otimes v_{2} = w_{1} \otimes w_{2}$ . Here, I regard $v_{1}, w_{1}$ as vectors in the vector space which is the corresponding fibre fo $L_{1}$ and similarly for $v_{2}, w_{2}$ and $L_{2}$ . The quotient of a manifold by an equivalence relation is not always a manifold, but in this case it is.
        
        I notice that you wrote “a particular fiber is isomorphic to $C$ ”. Your error here is, it doesn’t matter what it’s isomorphic to, you should still think of it as an abstract vector space. So, if e.g. $V_{1}$ and $V_{2}$ are 1-dimensional vector spaces, then $V_{1} \otimes V_{2}$ is yet another “new” vector space. Yes, they are all isomorphic, but they are not canonically isomorphic.
        So8res 25 Dec 2022 1:42 UTC
        2 points
        0
        Parent
        Thanks! Cool, it makes sense to me how we can make the pullback of $L$ with $X^{n} \times R$ , in $n$ different ways to get $n$ different line bundles, and then tensor them all together. (I actually developed that hypothesis during a car ride earlier today :-p.)
        
        (I’m still not quite sure what the syntax $L^{\otimes q_{i}}$ means, but presumably the idea is that there’s an automorphism on 1D vector fields that flips the sign, and we flip the sign of the negative-charge line bundles before tensoring everything together?)
        
        (Also, fwiw, when I said “they’re all isomorphic to $C$ ”, I meant that I didn’t expect to figure much out by looking at a single fiber in isolation, and did not mean to imply that there was a canonical isomorphism; it’s clear to me that lacking access to a particular isomorphism is kinda the whole point. That said, I appreciate the pedagogy anyway! I prefer overexplanations to underexplanations whenever my convo-partner is up for generating them.)
        
        Thanks again!
        Vanessa Kosoy 7 Jan 2023 12:59 UTC
        4 points
        0
        Parent
        The syntax $L^{\otimes q}$ means ” $L$ to the tensor power of $q$ ”. For $q > 0$ , it just means tensoring $L$ with itself $q$ times. For $q = 0$ , $L^{\otimes q}$ is just the trivial line bundle with total space $Y \times C$ (and, yes, all line bundles are isomorphic to the trivial line bundle, but this one just is the trivial bundle… or at least, canonically isomorphic to it). For $q < 0$ , we need the notion of a dual vector bundle. Any vector bundle $V$ has a dual $V^{*}$ , and for a line bundle the dual is also the inverse, in the sense that $L \otimes L^{*}$ is canonically isomorphic to the trivial bundle. We can then define all negative powers by $L^{\otimes q} := (L^{*})^{\otimes - q}$ . Notice that non-negative tensor powers are defined for all vector bundles, but negative tensor powers only make sense for line bundles.
        
        It remains to explain what is $V^{*}$ . But, for our purposes we can take a shortcut. The idea is, for any finite-dimensional complex vector space $U$ with an inner product, there is a canonical isomorphism between $U^{*}$ and $¯ U$ , where $¯ U$ is the complex-conjugate space. What is the complex-conjugate space? It is a vector space that (i) has the same set of vectors (ii) has the same addition operation and (iii) has its multiplication-by-scalar operation modified, so that multiplying $u$ by $z$ in $¯ U$ is the same thing as multiplying $u$ by $¯ z$ in $U$ , where $¯ z$ is just the complex number conjugate to $z$ .
        
        Equipped with this observation, we can define the dual of a Hermitian line bundle $L$ to be $¯ L$ , where $¯ L$ is the bundle obtained for $L$ by changing its multiplication-by-scalar mapping in the obvious way.
  - Algon 22 Dec 2022 14:45 UTC
    3 points
    0
    Parent
    Vanessa’s reply is excellent, and Steven clearly descibed the gauge transforms of EM that I was gesturing at. That said, if you want to see some examples of fibre bundles in physics other than those in non-relativistic QM, nLab has a great article on the topic.
    Also, if you know some GR, the inability to have more than local charts in general spacetimes makes clear that you do need to user bundles other than trivial ones i.e. ones of the form $M \times T$ . In my view, GR makes the importance of between global structure = fibre bundle clearer than QM, but maybe you have different intuitions.
mako yass 12 Jun 2024 23:18 UTC
3 points
0
where $n l o g_{2}$ denotes logarithm base $\frac{1}{2}$ , aka the negative of the (base 2) logarithm
I’ve never heard of $l o g_{1 / 2}$ before and I’m having difficulty looking it up. Is it distinct from $2^{n}$ , or $- l o g_{2} (n)$ ?
Edit: No, it’s equivalent to $- l o g$ , but this is only confirmed later in the post. Okay, where does this obscuritan “ $n l o g$ ” notation come from?
Buck 26 Dec 2023 16:45 UTC
3 points
0
FWIW, this point was made by Marcus Hutter when I took his algorithmic information theory class in 2014 (including proving that it differs only by an additive constant); I understood this to be a basic theorem of algorithm information theory. I’ve always called the alt-complexity of X “the Solomonoff prior on X”.
I agree it’s regrettable that people are used to talking about the K-complexity when they almost always want to use what you call the alt-complexity instead. I personally try to always use “Solomonoff prior” instead of “K-complexity” for this reason, or just say “K-complexity” and wince a little.
Re “algorithmic information theory would be more elegant if they used this concept instead”: in my experience, algorithm information theorists always do use this concept instead.
- Noosphere89 26 Dec 2023 17:41 UTC
  3 points
  0
  Parent
  Do they know that it does not differ by a constant in the infinite sequence case?
  - Buck 26 Dec 2023 20:12 UTC
    2 points
    0
    Parent
    No idea, sorry.
Mitchell_Porter 21 Dec 2022 12:27 UTC
3 points
0
I never thought of the analogy between needing to choose a gauge and needing to choose a particular Turing machine. I have, however (in conversation with T. Clark), mused a lot about the possible generalizations of gauge theory beyond physics. You have economic gauge theory (Eric Weinstein), you have arithmetic gauge theory (Minhyong Kim), you have gauge theory in deep learning (equivariant networks). Maybe there is an abstract homotopic approach to the costs of translating from one Turing machine to another.
Regarding gauge theory itself, I’ve never read this book, but Richard Healey’s “Gauging What’s Real” is said to be a good study of the ontological questions.
Yoav Ravid 12 Feb 2023 15:58 UTC
2 points
0
Does this have anything to say about the Many-Worlds Interpretation?
Thomas Sepulchre 23 Dec 2022 16:45 UTC
2 points
1
I think that, if you use the base 2, the alt-complexity might be ill-defined
Assume that, for your function $f$ , there is some program $p$ , such that it computes $f$ , and you can add any string after $p$ without changing the fact that it computes $f$ . For example, you could think that $p$ finishes with some stopping sequence, such that anything after will be ignored
Then, assuming $N$ is the length of $p$ , there are at least one program of length $N$ , 2 programs of length $N + 1$ , and, in general $2^{k}$ programs of length $N + k$ computing f. Thus, the sum diverges
Unless I made a mistake, this hints that the choice of base is not neutral, and any choice below 2 (including 2) makes the definition ill-posed
- So8res 23 Dec 2022 18:12 UTC
  3 points
  0
  Parent
  You’re over-counting programs. I didn’t spell out definitions of “programming language” and “length”, but an important disideratum is that there has to only be finitely much “length” to go around, in the sense that $\sum_{p \in P} r e x p (l e n g t h (p))$ must converge.
  
  Under your proposal, the total amount of “length” is $\sum_{n = 1}^{\infty} 2^{n} \cdot 2^{- n} = \sum_{n = 1}^{\infty} 1 = \infty$ , so this is not an admissible notion of length.
  
  (Note: this argument has nothing to do with the choice of base 2, and it applies equally well for all bases.)
  
  Two common ways of repairing your notion of length are:
  1. Using a prefix-free code for programs, such that if $p$ is a program then it is not the prefix of any other program.
  2. Thinking of programs as infinite bitstrings (which you might intuitively factor into a finite prefix describing a probabilistic program, and an infinite suffix serving as the random seed).
  In the former case, when $001$ is a valid code for a program $p$ , then any extension to that code (like $0010$ or $0011$ ) is not valid, and so a total of $2^{- 3}$ length is allocated to $p$ and no other length is allocated to any program that starts with $001$ , and it’s not too hard to see that a total of 1 length is allocated overall.
  
  In the latter case, we don’t count $001$ as separate from $0010...$ and $0011...$ ; we only count the infinite programs, and observe that a total of $2^{- 3}$ length is allocated to the collection of all programs that start with $001$ , and that $1$ length is allocated overall.
  
  You ran into trouble because you were trying to count both the prefixes like $001$ and all their continuations, which is overcounting :-).
  - Thomas Sepulchre 23 Dec 2022 20:55 UTC
    1 point
    0
    Parent
    Thanks!
Cole Wyeth 7 Jul 2023 18:58 UTC
1 point
0
As noted by interstice, this has been defined for infinite and finite strings (instead of functions). I believe Li and Vitanyi use the notation KM.
Antoine de Scorraille 30 May 2023 8:05 UTC
1 point
0
Great! Isn’t it generalizable to any argmin/argmax issues? Especialy thinking about the argmax decision theories framework, which is a well-known difficulty for safety concerns.
Similarly, in EA/action-oriented discussions, there is a reccurent pattern like:

Eager-to-act padawan: If world model/moral theory X is best likely to be true (due to evidence y z...), we need to act accordingly with the controversial Z! Seems best EU action!
Experienced jedi: Wait for a minute. You have to be careful with this way of thinking, because there are unkwown unknown, unilateralist curse and so on. A better decision-making procedure is to listen several models, severals moral theories, and to look for strategies acceptable by most of them.