Here is an additional perspective, if it’s helpful. Suppose you live in some measure space X, perhaps for concreteness. For any can make sense of a space of functions for which the integral is defined (this technicality can be ignored at a first pass), and we use the norm . Of course, and are options, and when X is a finite set, these norms give the sum of absolute values as the norm, and the square root of the sum of squares of absolute values, respectively.
Fun exercise: If X is two points, then is just with the norm . Plot the unit circle in this norm (i.e, the set of all points at a distance of from the origin). If , you get a square (rotated 45 degrees). If you get the usual circle. If you get a square. As varies, the “circle” gradually interpolates between these things. You could imagine trying to develop a whole theory of geometry—circumferences, areas, etc., in each of these norms (for example, is taxicab geometry). What the other comments are saying (correctly) is that you really need to know about your distribution of errors. If you expect error vectors to form a circular shape (for the usual notion of circle), you should use . If you know that they form a diamond-ish shape, please use . If they form a box (uniform distribution across both axes, i.e., the two are genuinely uncorrelated) use .
To have a larger and more useful example, let be a 1000 x 1000 grid, and regard as the set of possible colors between absolute black at and absolute white at (or use any other scheme you wish). Then a function is just a 1000 x 1000 pixel black-and-white image. The space of all such functions has dimension as a real vector space, and the concern about whether integrals exist goes away ( is a finite set). If and are two such images, the whole discussion about concerns precisely the issue of how we measure the distance .
Suppose I have a distinguished subspace of my vector space. Maybe I took ten thousand pictures of human faces perfectly centered in the frame, with their faces scaled to exactly fit the 1000 x 1000 grid from top to bottom. The K can be the span of those photos, which is roughly the “space of human face images” or at least the “space of images that look vaguely face-like” (principal component analysis tells us that we actually want to look for a much smaller-dimensional subspace of that contains the most of the real “human face” essence, but’s let’s ignore that).
I have a random image . I want to know if this image is a human face. The question is “how far away if from the distinguished subspace ?” I might have some cutoff distance where images below the cutoff are classified as faces, and images away from the cutoff as non-faces.
Speaking in very rough terms, a Banach space is a vector space that comes with a notion of size and a notion of limits. A Hilbert space is a vector space that comes with a notion of size, a notion of angles, and a notion of limits. The nice thing about Hilbert spaces is that the question “how far away if from the distinguished subspace ?” has a canonical best answer: the orthogonal projection. It is exactly analogous to the picture in the plane where you take , draw the unique line from to that makes a right angle, and take the point of intersection. That point is the unique closest point, and the projection onto the closet point is a linear operator.
In a Banach space, as you have observed in this post, there need not exist a unique closest point. The set of closest points may not even be a subspace of , it could be an unpleasant nonlinear object. There are no projection operators, and the distance between a point and a subspace is not very well-behaved.
It is a theorem that is a Hilbert space if and only if . That is, only the norm admits a compatible notion of angles (hence perpendicularity/orthogonality, hence orthogonal projection). You can talk about the “angle between” two images in the norm, and subtract off the component of one image “along” another image. I can find which directions in are the “face-iest” by finding the vector that my data points have the strongest components around, then project away from that direction, find a vector orthogonal to my “face-iest” vector along which the component of everything is strongest, project that away, etc., and identify a lower-dimensional subspace of that captures the essential directions (principal components) of that capture the property of “face-iness.” There are all kinds of fun things you can do with projection to analyze structure in a data set. You can project faces onto face-subspaces, e.g., project a face onto the space of female faces, or male faces, and find the “best female approximation” or “best male approximation” of a face, you could try to classify faces by age, try to recognize whether a face belongs to one specific person (but shot independently across many different photos) by projecting onto their personal subspace, etc.
In the infinite-dimensional function spaces, these projections let you reconstruct the best approximation of a function by polynomials (the Taylor approximations) or by sinusoids (the Fourier approximations) or by whatever you want (e.g. wavelet approximations). You could project onto eigenspaces of some operator, where the projection might correspond to various energy levels of some Hamiltonian.
The reason we prefer in quantum mechanics is because, while eigenvectors and eigenvalues only make sense, projection operators onto eigenspaces are very much a special thing. Thus, things like spectral decomposition, and projection onto eigenstates of observable operators, all require you to live inside of .
None of this works in for precisely because “orthogonal” projection does not make sense. That is not to say that these spaces are useless. If you are in a function space, you might sometimes do weird things that take you outside of (i.e., they make the integral in question undefined) and you end up inside of some other , and that may not necessarily be the end of the world. Something as innocent as multiplication point-by-point can take you outside of the space you started in, and put you into a different one (i.e., the integral of might fail to exist, but the integral of for some may converge). That is to say, these spaces are Banach spaces but are not Banach algebras—they do not have a nice multiplication law. The flexibility of moving to different ’s can help get around this difficulty. Another example is taking dual spaces, which generally requires a different unless, funnily enough, (the only space that is self-dual).
tl;dr if you want geometry with angles, you are required to use . If you don’t feel like you need angles (hence, no orthogonality, no projections, no spectral decompositions, no “component of x along y”, no PCA, etc.) then you are free to use . Sometimes the latter is forced upon you, but usually only if you’re in the infinite-dimensional function space setting.
Thanks for the pointer to davinci-003! I am certainly not interested in ChatGPT specifically, it just happens to be the case that ChatGPT is the easiest to pop open up and start using for a non-expert (like myself). It was fun enough to tinker with, so I look forward to checking out davinci.
I had not heard of GPT-f—appreciate the link to the paper! I’ve seen some lean demonstrations, and they were pretty cool. It did well with some very elementary topology problems (reasoning around the definition of “continuous”), and struggled with analysis in interesting ways. There was some particular theorem (maybe the extreme value theorem? I could be forgetting) that it was able to get in dimension 1 without too much trouble, but that it struggled hard with in dimension 2, in a way that a human would not really struggle with (the proof of this particular in dimension 2, or dimension n for that matter, is very nearly identical at first reading). Breaking down it’s failure, the demonstrators argued pretty convincingly that perhaps there’s actually just honestly more going on in the dimension 2 case than the dimension 1 case that a human prover might be glossing over. The machine can’t say “this set is compact, so the extension of the previous argument is completely analogous/obvious/trivial,” it has to actually go through the details of proving compactness in detail, carrying out the extension, etc.. The details may not be deep (in this particular case), but they are different. I think it would be really cool to see lean-like systems that are capable of making possibly illegitimate logical leaps (canonically: “this is trivial,” or “we leave this as an exercise”) to escape some nasty, tangential sub-sub-argument they’re stuck on, even if they diverge from that point and ultimately get the conclusion wrong (as you say in bullet 3: just run 500 more randomized instances, and see where they go, or inspect the results manually), just to see where they might go next. Regarding code-writing capabilities, it doesn’t seem like there’s anything in principle special about python, and it would be cool to start seeing variants pop up with access to larger mathematics languages (think sage, matlab, mathematica, magma, etc.) that might be more directly useful to automatic proof-writing.
I partially agree and partially disagree on point 3.
Partial agreement: For an application like “I am trying to prove [Theorem X] that I have good reason to believe is completely provable with nothing more than the toolbox already in this paper ([Lemma 1], [Lemma 2], [Lemma 3], …, [Lemma n]) plus standard results in the literature, then I agree, reward it for directions that seem productive and seem likely to shorten the proof, and penalize it for irrelevant tangents. Build a corpus of training theorems that are similar to the things we care about, where we know of the existence of a proof using tools that we know the system is aware of (e.g., don’t train a system on only analysis texts and give a problem that you know to require Galois theory). Some models will produce proofs nearly as short or shorter than the proofs we know about, others will go off on tangents and build castles to nowhere, never really getting around to the point. Clearly we prefer the former. This is likely to be the main near-term application of these systems.
Partial disagreement: For a longer term project, where the system is allowed to make more open ended moves than “help me finish the proof of [Theorem 3.2.6],” I don’t quite agree that it is entirely obvious how to “align” the system to do things that are productive. I expanded a bit on this in another comment in this thread, but imagine a scenario where you have a long (which may mean indefinitely long: maybe 5 years, maybe 20, maybe 200) term project like “prove Fermat’s last theorem” (in a counterfactual where we don’t already know it) or “prove the abc conjecture” or “prove the Collatz conjecture” or something. Some easy to state theorems that are easy for any system competent in basic algebra and number theory manipulations to get started on, easy to generate and test examples, clearly interesting to all parties involves, and of very high interest. We cannot evaluate a system based on whether it produces shorter proofs of FLT in a hypothetical scenario where we don’t already know what the proof of FLT will involve. In the short term, the relative value of “productive-seeming” work (mass volumes of direct manipulations on the equation xn+yn=zn using standard methods that existed in Fermat’s time) vs “irrelevant tangents / castles in the sky” (inventing a theory of primary decompositions, modular forms, schemes; 95% of which is clearly irrelevant to FLT, but access to the broad body of theory appears necessary) is not entirely straightforward, since there is no easy way to test an irrelevant tangent on whether it makes the proof shorter or not. It’s very hard to evaluate whether a research direction is “promising,” and an unfortunate truth is that most of our evaluation comes down to trusting in the intuitions of the researcher that “this is important and relevant to the heart and soul of FLT.”
I think it’s a problem that grant application evaluators face. I want to fund the research team that will most strongly contribute to long term progress on FLT (or pick a problem that seems likely to have a 30 year horizon, rather than a 200 year horizon), and both teams are motivated to convince me they deserve the funding. Even for humans, this is not an easy task. Generally speaking, given a sufficiently technical mathematical paper (or cluster of papers in the authorship graph), there may exist no more than 20 people on Earth that really understand the subject enough to honestly evaluate how relevant that paper is to the grant subject: all of those people are friends of the author (and may be motivated to advance the author’s career by helping them secure a big grant, irrespective of the abstract relevance of the research; one author in their network’s success contributes to their own success, via collaborations, citations, etc.), and none of whom are evaluating the grant applications. That is, someone is building a tower / castle in the sky, and the only people who really understand the castle are personal friends of the builder. We are placing a lot of trust in the builder (and their authorship network) when we choose to fund their project, like it or not. So, here the research teams are two (or a thousand) competing parameterizations of some model in a training environment, I may strongly suspect that massive towers will have to be built, but I don’t want to trust my own intuitions about which towers are useful or not. I want to at least leave open the possibility that we (like Fermat’s contemporaries) have just no idea what needs to be done to make progress, and that we should allow “well-aligned” crazy ideas to flourish. I have to (at least partially) trust that the tower-builders genuinely “believe” on some internal level that they’re building in the right direction.
I of course agree entirely that systems that are advanced enough for the above to actually be a problem worth intensely caring about are quite a ways off, but they will come to exist sooner or later, and I don’t think it’s obvious how to get those systems building in productive directions.