Looking at this again, I’m not sure I understand the two confusions. P vs. NP isn’t about functions that are hard to compute (they’re all polynomially computable), rather functions that are hard to invert, or pairs of easily computable functions that hard to prove are equal/not equal to each other. The main difference between circuits and Turing machines is that circuits are finite and bounded to compute whereas the halting time of general Turing machines is famously impossible to determine. There’s nothing special about Boolean circuits: they’re an essentially complete model of what can be computed in polynomial time (modulo technicalities)
Dmitry Vaintrob
looks like you referenced the same paper before me while I was making my comment :)
Yeah I think this is a good place to probe assumptions, and it’s probably useful to form world models where you probability of P = NP is nonzero (I also like doing this for inconsistency of logic). I don’t have an inside view, but like Scott Aaronson on this: https://www.scottaaronson.com/papers/pnp.pdf:
Kinda silly to do this with an idea you actually care about, especially if political (which would just increase the heat:light ratio in politics along the grain for Russian troll factories etc.). But carefully trying to make NN traps with some benign and silly misinformation—e.g. “whales are fish” or something—could be a great test to see if weird troll-generated examples on the internet can affect the behavior
Maybe I’ll add two addenda:
-
It’s easy to confuse entropy with free energy. Since energy is conserved, globally the two measure the same thing. But locally, the two decouple, and free energy is the more relevant parameter here. Living processes often need to use extra free energy to prevent the work they are interested in doing from getting converted into heat (e.g. when moving we’re constantly fighting friction); in this way we’re in some sense locally increasing free energy.
-
I think a reasonable (though imperfect) analogy here is with potential energy. Systems tend to reduce their potential energy, and thus you can make a story that, in order to avoid just melting into a puddle on the ground, life needs to constantly fight the tendency of gravitational potential energy to be converted to kinetic energy (and ultimately heat). And indeed, when we walk upright, fly, build skyscrapers, use hydro power, we’re slowing down or modifying the tendency of potential energy to become kinetic. But this is in no sense the fundamental or defining property of life, whether we’re looking globally at all matter or locally at living beings. We sometimes burrow into the earth, flatten mountains, etc. While life both (a), can use potential energy of other stuff to power its engines and (b), needs to at least somewhat fight the tendency of gravitational kinetic energy to turn it into a puddle of matter without any internal structure, this is just one of many physical stories about life and isn’t “the whole story”.
-
I think one shouldn’t think of entropy as fundamentally preferred or fundamentally associated with a particular process. Note that it isn’t even a well-defined parameter unless you posit some macrostate information and define entropy as a property of a system + the information we have about it.
In particular, life can either increase or decrease appropriate local measurements of entropy. We can burn the hydrocarbons or decay the uranium to increase entropy or we can locally decrease entropy by changing reflectivity properties of earth’s atmosphere, etc.
The more fundamental statement, as jessicata explains, is that life uses engines. Engines are trying to locally produce energy that does work rather than just heat, i.e., that has lower entropy compared to what one would expect from a black body. This means that they have to use free energy, which corresponds tapping into aspects of the surrounding environment where entropy has not yet been maximized (i.e., which are fundamentally thermodynamic rather than thermostatic), and they also have to generate work which is not just heat (i.e., they can’t just locally maximize the entropy). Life on earth mostly does this by using the fact that solar radiation is much higher-frequency than black-body radiation associated to temperatures on Earth, thus contains free energy (that can be released by breaking it down).
This is awesome!
I also wouldn’t give this result (if I’m understanding which result you mean) as an example where the assumptions are technicalities / inessential for the “spirit” of the result. Assuming monotonicity or commutativity (either one is sufficient) is crucial here, otherwise you could have some random (commutative) group with the same cardinality as the reals.
Generally, I think math is the wrong comparison here. To be fair, there are other examples of results in math where the assumptions are “inessential for the core idea”, which I think is what you’re gesturing at. But I think math is different in this dimension from other fields, where often you don’t lose much by fuzzing over technicalities (in fact the question of how much to fuss over technicalities like playing fast and loose with infinities or being careful about what kinds of functions are allowed in your fields is the main divider between math and theoretical physics).
In my experience in pure math, when you notice that the “boilerplate” assumptions on your result seem inessential, this is usually for one of the following reasons:
In fact, a more general result is true and the proof works with fewer/weaker assumptions, but either for historical reasons or for reasons of some results used (lemmas, etc.) being harder in more generality, it’s stated in this form
The result is true in more generality, but proving the more general result is genuinely harder or requires a different technique, and this can sometimes lead to new and useful insights
The result is false (or unknown) in more technicality, and the “boilerplate” assumptions are actually essential, and understanding why will give more insight into the proof (despite things seeming inessential at first)
The “boilerplate” assumptions the result uses are weaker than what the theorem is stated with, but it’s messy to explain the “minimal” assumptions, and it’s easier to compress the result by using a more restrictive but more standard class of objects (in this way a lot of results that are true for some messy class of functions are easier to remember and use for a more restrictive class: most results that use “Schwartz spaces” are of this form; often results that are true for distributions are stated for simplicity for functions, etc.).
Some assumptions are needed for things to “work right,” but are kind of “small”: i.e., trivial to check or mostly just controlling for degenerate edge cases, and can be safely compressed away in your understanding of the proof if you know what you’re doing (a standard example is checking for the identity in group laws: it’s usually trivial to check if true, and the “meaty” part of the axiom is generally associativity; another example is assuming rings don’t have 0 = 1, i.e., aren’t the degenerate ring with one element).
There’s some dependence on logical technicalities, or what axioms you assume (especially relevant in physics- or CS/cryptography- adjacent areas, where different additional axioms like P != NP are used, and can have different flavors which interface with proofs in different ways, but often don’t change the essentials).
I think you’re mostly talking about 6 here, though I’m not sure (and not sure math is the best source of examples for this). I think there’s a sort of “opposite” phenomenon also, where a result is true in one context but in fact generalizes well to other contexts. Often the way to generalize is standard, and thus understanding the “essential parts” of the proof in any one context are sufficient to then be able to recreate them in other contexts, with suitably modified constructions/axioms. For example, many results about sets generalize to topoi, many results about finite-dimensional vector spaces generalize to infinite-dimensional vector spaces, etc. This might also be related to what you’re talking about. But generally, I think the way you conceptualize “essential vs. boilerplate” is genuinely different in math vs. theoretical physics/CS/etc.
Nitpick, but I don’t think the theorem you mention is correct unless you mean something other than what I understand. For the statement I think you want to be true, the function also needs to be a group law, which requires associativity. (In fact, if it’s monotonic on the reals, you don’t need to enforce commutativity, since all continuous group laws on R are isomorphic.)
This is very cool!
Toward A Mathematical Framework for Computation in Superposition
Right—looking at energy change of the exhaust explains the initial question in the post: why energy is preserved when a rocket accelerates, despite apparently expending the same amount of fuel for every unit of acceleration (assuming small fuel mass compared to rocket). Note that this doesn’t depend on a gravity well—this question is well posed, and well answered (by looking at the rocket + exhaust system) in classical physics without gravity. The Oberth phenomenon is related but different I think
Hi! As I commented on your other post: I think this is a question for https://mathoverflow.net/ or https://math.stackexchange.com/ . This question is too technical, and does not explain a connection to alignment. If you think this topic is relevant to alignment and would be interesting to technical people on LW, I would recommend making a non-technical post that explains how you think results in this particular area of analysis are related to alignment.
Hi! I think this is a question for https://mathoverflow.net/ or https://math.stackexchange.com/ . While Lesswrong has become a forum for relatively technical alignment articles, this question is too math-heavy, and it has not been made clear how this is relevant to alignment. The forum would get too crowded if very technical math questions became a part of the standard content.
I think it’s very cool to play with token embeddings in this way! Note that some of what you observe is, I think, a consequence of geometry in high dimensions and can be understood by just modeling token embeddings as random. I recommend generating a bunch of tokens as a Gaussian random variable in a high-dimensional space and playing around with their norms and their norms after taking a random offset.
Some things to keep in mind, that can be fun to check for some random vectors:
- radii of distributions in high-dimensional space tend to cluster around some fixed value. For a multivariate Gaussian in n-dimensional space, it’s because the square radius is a sum of squares of Gaussians (one for each coordinate). This is a random variable with mean O(n) and standard deviation . In your case, you’re also taking a square root (norm vs. square norm) and normalization is different, but the general pattern of this variable becoming narrow around a particular band (with width about compared to the radius) will hold.
- a random offset vector will not change the overall behavior (though it will change the radius).
- Two random vectors in high-dimensional space will be nearly orthogonal.
On the other hand it’s unexpected that the mean is so large (normally you would expect the mean of a bunch of random vectors to be much smaller than the vectors themselves). If this is not an artifact of the training, it may indicate that words learn to be biased in some direction (maybe a direction indicating something like “a concept exists here”). The behavior of tokens near the center-of-mass also seems really interesting.
I think there is some misunderstanding of what SLT says here, and you are identifying two distinct notions of complexity as the same, when in fact they are not. In particular, you have a line
“The generalisation bound that SLT proves is a kind of Bayesian sleight of hand, which says that the learning machine will have a good expected generalisation relative to the Bayesian prior that is implicit in the learning machine itself.”
I think this is precisely what SLT is saying, and this is nontrivial! One can say that a photon will follow a locally fastest route through a medium, even if this is different from saying that it will always follow the “simplest” route. SLT arguments always works relative to a loss landscape, and interpreting their meaning should (ideally) be done relative to the loss landscape. The resulting predictions are, nevertheless, nontrivial, and are sometimes confirmed. For example we have some work on this with Nina Rimsky.
You point at a different notion of complexity, associated to considering the parameter-function map. This also seems interesting, but is distinct from complexity phenomena in SLT (at least from the more basic concepts like the RLCT), and which is not considered in the basic SLT paradigm. Saying that this is another interesting avenue of study or a potentially useful measure of complexity is valid, but is a priori independent of criticism of SLT (and of course ideally, the two points of view could be combined).
Note that loss landscape considerations are more important than parameter-function considerations in the context of learning. For example it’s not clear in your example why f(x) = 0 is likely to be learned (unless you have weight regularization). Learning bias in a NN should most fundamentally be understood relative to the weights, not higher-order concepts like Kolmogorov complexity (though as you point out, there might be a relationship between the two).
Also I wanted to point out that in some ways, your “actual solution” is very close to the definition of RLCT from SLT. The definition of the RLCT is how much entropy you have to pay (in your language, the change in negative log probability of a random sample) to gain an exponential improvement of loss precision; i.e., “bits of specification per bit of loss”. See e.g. this article.
The thing is, the “complexity of f” (your K(f)) is not a very meaningful concept from the point of view of a neural net’s learning (you can try to make sense of it by looking at something like the entropy of the weight-to-function mapping, but then it won’t interact that much with learning dynamics). I think if you follow your intuitions carefully, you’re likely to precisely end up arriving at something like the RLCT (or maybe a finite-order approximation of the RLCT, associated to the free energy).
I have some criticisms of how SLT is understood and communicated, but I don’t think that the ones you mention seem that important to me. In particular, my intuition is that for purposes of empirical measurement of SLT parameters, the large-sample limit of realistic networks is quite large enough to see approximate singularities in the learning landscape, and that the SGD-sampling distinction is much more important than many people realize (indeed, there is no way to explain why generalizable networks like modular addition still sometimes memorize without understanding that the two are very distinct).
My main update in this field is that people should be more guided by empiricism and experiments, and less by competing paradigms of learning, which tend to be oversimplified and to fail to account for messy behaviors of even very simple toy networks. I’ve been pleasantly surprised by SLT making the same update in recent months.
Interesting—what SLT prediction do you think is relevant here?
Noticed thad I didn’t answer Kaarel’s question there in a satisfactory way. Yeah—“basin” here is meant very informally as a local piece of the loss landscape with lower loss than the rest of the landscape, and surrounding a subspace of weight space corresponding to a circuit being on. Nina and I actually call this a “valley” our “low-hanging fruit” post.
By “smaller” vs. “larger” basins I roughly mean the same thing as the notion of “efficiency” that we later discuss
In particular, in most unregularized models we see that generalize (and I think also the ones in omnigrok), grokking happens early, usually before full memorization (so it’s “grokking” in the redefinition I gave above).
In particular, it’s not hard to produce a computable function that isn’t given by a polynomial-sized circuit (parity doesn’t work as it’s polynomial, but you can write one down using diagonalization—it would be very long to compute, but computable in some suitably exponentially bounded time). But P vs. NP is not about this: it’s a statement that exists fully in the world of polynomially computable functions.