Confusion I have, interested to hear thoughts: To me Neural Networks seem more like combinatorial objects than smooth manifolds. So it doesn’t make sense to me that methods that try to utilize subtle things about the differential geometry of a network like curvature wrt parameters or inputs, will be able to tell you anything interesting about the high level behavior of the network or its training dynamics.
The reason I think this is because ReLU networks have no curvature. Locally about a point, whether a ReLU is on or off won’t change, so the loss landscape and output landscape are kind of just like a bunch of flat facets. (assuming we ignore the loss function, or things like putting a softmax at the end). And like, Sigmoid vs GeLU vs ReLU vs SiLU etc, they all train networks that end up with the same behavior. So if you use a smooth activation function, I don’t think the extra smoothness “adds anything important” to the network.
There are other arguments too, like many of the components in trained language models exhibit behavior where they’re very clearly either on or off.
However, there are parts of this that do not make sense.
1) Optimizers with momentum like Adam really only make sense when you have something that’s locally like a smooth convex problem.
2) The core thing in SLT is like the learning coefficient, which is related to the curvature of the network. And it seems like people have managed to tie that to interesting high level behaviors.
What is the right way to view this? It seems to me like, when you have a singular instance of a neural network operating on a single sample, its best seen as a combinatorial object. However, optimizers operate over expectations and in this domain networks are “on average smooth”. (average loss over two samples, the “facets” get cut in half, and you have a “smoother” object. Average over infinity samples and get a perfectly smooth object).
My instinct on this is that the loss surface with just relus is as you say a bunch of intersecting planes, but with a large enough neural network these are cut up and recombined enough to form a function with small enough “facets” that they are insignificant compared to the step size of the optimiser, and the surface therefore might as well be smooth.
However I have no maths to back this up, and will defer to anyone who has done any calculations at all.
Several people have noted that with enough piece‑wise linear regions a RELU network can approximate any smooth target function to arbitrary precision, so your model is already behaving like a smooth function on a (dense) domain of interest. The whole point is whats of interest.
There are a number of approximation theorems about polynomials here, but you can realize quickly that the bounded error between a C^2 function and a piecewise mesh (akin to relus) under an L_p norm ought to of the order of the size of the mesh (squared). There are some linear interpolation theorems that are useful in this case.
For a piecewise linear interpolation mesh of size h, then the approximation error should be bounded by h**2.
NN’s are just approximations, chose your approximation settings, but the universal approximation theorem guarantees existence (not convergence) of an arbitrarily good approximation that could be represented with an arbitrarily large Parameterization. This is simply to say that under some pretty simple settings, with enough layers, and enough time to fiddle, (and the guarantees of scaling papers) you will eventually look quite smooth and quite accurate.
“so the loss landscape and output landscape are kind of just like a bunch of flat facets”
This is only if the output landscape does not in fact focus on this area. But it is NOT true that if the output landscape is flat, the loss landscape is flat, it can be both highly curved and quite uninteresting to an optimizer.
Let your loss function be l(x,y) = (x-y)**2, clearly, even if x is fixed, the loss landscape is both smooth and curved.
Even though each region of a ReLU network is affine in the input (zero Hessian),the loss as a function of parameters is piece‑wise quadratic (for MSE) or piece‑wise smooth (for CE). Crossing a single activation wall changes the quadratic piece, so the parameter‑space Hessian is generically full‑rank on each region and can be highly curved.
A really simple way to see this: One thing to actually see is to try to approximate even a sign function with incredibly simple MLP’s, as you interpolate points and allow the MLP to grow in parameters, you will see it becoming quite smooth, but it wont start that way.
1) Optimizers with momentum like Adam really only make sense when you have something that’s locally like a smooth convex problem.
Adam does not at all require convexity, in fact the original paper only requires that the gradients are lipshitz. RELU nets are lipshitz and differentiable (smooth) on a dense set, so we are perfectly fine here.
2) The core thing in SLT is like the learning coefficient, which is related to the curvature of the network. And it seems like people have managed to tie that to interesting high level behaviors.
Yes...yes they have. But Adam has been around for quite a while, and when you go to NIPS, you’ll notice SLT is not the largest component of work there.
As far as NN’s being combinatorics, sure, that’s a way to view it, they combine objects in different ways and output results. But so does any high dimensional function. A combinatorics estimator and a smooth function are not so different in the limit as you noted.
The learning rate in modern optimisers is so large that the piecewise-linear loss landscape really looks indistinguishable from a smooth function. The lr you’d need to use to ensure that the next step lands in the same linear patch is ridiculously small, so in practice the true “felt” landscape is something like a smoothed average of the exact landscape.
AFAIK the smoothness can add useful properties at training time, because the gradient is more well-behaved around zero. And ReLUs won over sigmoids because not being flat on both sides allowed their gradients to propagate better across several layers (whereas with a sigmoid as soon as you cap on either side the gradient becomes zero and it becomes very hard to dislodge the system from that local minimum).
NNs are weird functions but I don’t think you can really describe with smooth manifolds most stuff you do with ML. Kolmogorov-Arnold function approximators, which are sorta related to NNs (really NNs are a subset of K-A approximators), are known to be weird functions, not necessarily smooth. And lots of problems, like classification problem (which btw is essentially what text prediction is) aren’t smooth to begin with.
There is some intuition that you have to enforce some sort of smooth-like property as a way of generalizing the knowledge and combating overfitting; that’s what regularization is for. But it’s all very vibey. What you would really need is a proper universal prior for your function that you then update with your training data, and we have no idea what that looks like—only empirical knowledge that some shit seems to work better for whatever reason.
I think your work in this paper is pretty much entirely subsumed by the following work showing that neural networks with piecewise linear activations are equivalent to max-affine spline operators: https://arxiv.org/abs/1805.06576
They seem to cover everything you do and more, although they don’t take a specifically tree-oriented viewpoint. Unfortunately, like many of the others in this thread, I don’t find results like this particularly compelling.
I think your work in this paper is pretty much entirely subsumed by the following work showing that neural networks with piecewise linear activations are equivalent to max-affine spline operators: https://arxiv.org/abs/1805.06576
They seem to cover everything you do and more, although they don’t take a specifically tree-oriented viewpoint. Unfortunately, like many of the others in this thread, I don’t find results like this particularly compelling.
I’m trying to weight the evidence for and against a neuralese-using architecture being found which is efficient enough to usurp the current architectures used in frontier AIs. I have some questions. My current perspective is not that sophisticated or thought through, its just:
Current View
A) reasons people try to replace current architectures
transformers are shallow. The causal path from an input token to an output token inside the model can maximally by O(transformer depth). So if the model wants to do complicated reasoning that requires more than O(depth) steps, it needs to cache intermediate results as tokens. But tokens are a very low information compared to embeddings, so this is a huge information bottleneck. Plausibly removing this would unlock a lot of capability in the model
Tokens are also fundamentally discrete, so they clip gradients. This makes RL training less efficient. ie most RL on getting the right answer for a problem will upweight probability of every token in trajectory, the reasoning being that on average “productive tokens” are overrrepresented in samples with the right answer, but its still the case that most of the tokens might be useless. But with a fully reccurent architecture, you could just upweight the probability of the final answer, and it would backpropagate and reward the intermediate steps according to their usefulness in a precise way.
Current architectures get slower with more context. Eg, attention is O(n) in a pretty fundamental way, because its looking over all past embeddings. But with recurrent architectures, past information is stored in a contant-size embedding, so inference is always O(1) per step.
B) reasons replacing shallow architectures is hard
Any fully recurrent architecture like this is not parallelizeable. You need to fully process token 1 before you can start processing token 2. With standard transformers you can do 1 step of processing on all tokens in the whole context independently, so its highly parallelizeable. Efficiency of training depends on this.
Crucially, during training, you also need to store past activations, so A3 is not that big an advantage here. Like during inference you have O(1) time and O(1) memory because you can forget hidden[1], hidden[2]… because hidden[n+1] depends just on hidden[n] (and input). But during training you need to backpropagate thru all of those, so you need to store them, and you get O(n) memory.
Also crucially, this is the other side of A1. Which means it might be fundamentally hard to fix A1 without incurring the cost of B1.
On top of training being computationally inefficient, it might be unstable and hard to make work. Like with 10000 tokens and 100 layers you need to backpropagate through a computational graph with paths a million in length.
I’m curious whether these points are correct and whether theres other important points I’m missing. Ideally someone has written a comprehensive article they can just link that addresses all of this.
Question/Worry:
Lastly, a worry occurred to me, which is that: the reason recurrent architectures are slow is in large part because of the parallelization thing (A1). However, the parallelization is mostly an advantage during pre-training. Because here you don’t have to do any multi-step generation.
However, during RL where you do thousand-token CoT full generations, this is not the case. In this regime, fully recurrent and shallow architectures should have similar performance profiles.
So, when agentic RL is scaled up and eventually becomes over 50% of training (which people seem to think will happen quite soon, with grok 4 maybe it already happened) the primary advantage of shallow architectures B1 falls away. (training cost is dominated by a regime where they’re equally performant) But the advantages of recurrent architectures remain all the same.
So this makes me think that scaling up RL makes people switching to recurrent architectures far more likely.
What are peoples thoughts on this argument/worry?
Confusion I have, interested to hear thoughts: To me Neural Networks seem more like combinatorial objects than smooth manifolds. So it doesn’t make sense to me that methods that try to utilize subtle things about the differential geometry of a network like curvature wrt parameters or inputs, will be able to tell you anything interesting about the high level behavior of the network or its training dynamics.
The reason I think this is because ReLU networks have no curvature. Locally about a point, whether a ReLU is on or off won’t change, so the loss landscape and output landscape are kind of just like a bunch of flat facets. (assuming we ignore the loss function, or things like putting a softmax at the end). And like, Sigmoid vs GeLU vs ReLU vs SiLU etc, they all train networks that end up with the same behavior. So if you use a smooth activation function, I don’t think the extra smoothness “adds anything important” to the network.
There are other arguments too, like many of the components in trained language models exhibit behavior where they’re very clearly either on or off.
However, there are parts of this that do not make sense.
1) Optimizers with momentum like Adam really only make sense when you have something that’s locally like a smooth convex problem.
2) The core thing in SLT is like the learning coefficient, which is related to the curvature of the network. And it seems like people have managed to tie that to interesting high level behaviors.
What is the right way to view this? It seems to me like, when you have a singular instance of a neural network operating on a single sample, its best seen as a combinatorial object. However, optimizers operate over expectations and in this domain networks are “on average smooth”. (average loss over two samples, the “facets” get cut in half, and you have a “smoother” object. Average over infinity samples and get a perfectly smooth object).
My instinct on this is that the loss surface with just relus is as you say a bunch of intersecting planes, but with a large enough neural network these are cut up and recombined enough to form a function with small enough “facets” that they are insignificant compared to the step size of the optimiser, and the surface therefore might as well be smooth.
However I have no maths to back this up, and will defer to anyone who has done any calculations at all.
@Dmitry Vaintrob
Several people have noted that with enough piece‑wise linear regions a RELU network can approximate any smooth target function to arbitrary precision, so your model is already behaving like a smooth function on a (dense) domain of interest. The whole point is whats of interest.
There are a number of approximation theorems about polynomials here, but you can realize quickly that the bounded error between a C^2 function and a piecewise mesh (akin to relus) under an L_p norm ought to of the order of the size of the mesh (squared). There are some linear interpolation theorems that are useful in this case.
For a piecewise linear interpolation mesh of size h, then the approximation error should be bounded by h**2.
https://www.cs.ubc.ca/~ascher/ag_2012/ag_slides/chap11.pdf
Take a look at slide 14.
https://arxiv.org/abs/1610.01145
This has some relu approximations as well.
NN’s are just approximations, chose your approximation settings, but the universal approximation theorem guarantees existence (not convergence) of an arbitrarily good approximation that could be represented with an arbitrarily large Parameterization. This is simply to say that under some pretty simple settings, with enough layers, and enough time to fiddle, (and the guarantees of scaling papers) you will eventually look quite smooth and quite accurate.
“so the loss landscape and output landscape are kind of just like a bunch of flat facets”
This is only if the output landscape does not in fact focus on this area. But it is NOT true that if the output landscape is flat, the loss landscape is flat, it can be both highly curved and quite uninteresting to an optimizer.
Let your loss function be l(x,y) = (x-y)**2, clearly, even if x is fixed, the loss landscape is both smooth and curved.
Even though each region of a ReLU network is affine in the input (zero Hessian),the loss as a function of parameters is piece‑wise quadratic (for MSE) or piece‑wise smooth (for CE). Crossing a single activation wall changes the
quadratic piece, so the parameter‑space Hessian is generically full‑rank on each region and can be highly curved.
A really simple way to see this:
One thing to actually see is to try to approximate even a sign function with incredibly simple MLP’s, as you interpolate points and allow the MLP to grow in parameters, you will see it becoming quite smooth, but it wont start that way.
1) Optimizers with momentum like Adam really only make sense when you have something that’s locally like a smooth convex problem.
Adam does not at all require convexity, in fact the original paper only requires that the gradients are lipshitz. RELU nets are lipshitz and differentiable (smooth) on a dense set, so we are perfectly fine here.
2) The core thing in SLT is like the learning coefficient, which is related to the curvature of the network. And it seems like people have managed to tie that to interesting high level behaviors.
Yes...yes they have. But Adam has been around for quite a while, and when you go to NIPS, you’ll notice SLT is not the largest component of work there.
As far as NN’s being combinatorics, sure, that’s a way to view it, they combine objects in different ways and output results. But so does any high dimensional function. A combinatorics estimator and a smooth function are not so different in the limit as you noted.
The learning rate in modern optimisers is so large that the piecewise-linear loss landscape really looks indistinguishable from a smooth function. The lr you’d need to use to ensure that the next step lands in the same linear patch is ridiculously small, so in practice the true “felt” landscape is something like a smoothed average of the exact landscape.
AFAIK the smoothness can add useful properties at training time, because the gradient is more well-behaved around zero. And ReLUs won over sigmoids because not being flat on both sides allowed their gradients to propagate better across several layers (whereas with a sigmoid as soon as you cap on either side the gradient becomes zero and it becomes very hard to dislodge the system from that local minimum).
NNs are weird functions but I don’t think you can really describe with smooth manifolds most stuff you do with ML. Kolmogorov-Arnold function approximators, which are sorta related to NNs (really NNs are a subset of K-A approximators), are known to be weird functions, not necessarily smooth. And lots of problems, like classification problem (which btw is essentially what text prediction is) aren’t smooth to begin with.
There is some intuition that you have to enforce some sort of smooth-like property as a way of generalizing the knowledge and combating overfitting; that’s what regularization is for. But it’s all very vibey. What you would really need is a proper universal prior for your function that you then update with your training data, and we have no idea what that looks like—only empirical knowledge that some shit seems to work better for whatever reason.
This thread might be fun for you, where Reddit talks about some papers that draw connections between NNs and decision trees. https://www.reddit.com/r/MachineLearning/comments/y2pi2a/r_neural_networks_are_decision_trees/
In particular, look for the comment that goes
I think your work in this paper is pretty much entirely subsumed by the following work showing that neural networks with piecewise linear activations are equivalent to max-affine spline operators: https://arxiv.org/abs/1805.06576
They seem to cover everything you do and more, although they don’t take a specifically tree-oriented viewpoint. Unfortunately, like many of the others in this thread, I don’t find results like this particularly compelling.
I’m trying to weight the evidence for and against a neuralese-using architecture being found which is efficient enough to usurp the current architectures used in frontier AIs. I have some questions. My current perspective is not that sophisticated or thought through, its just:
Current View
A) reasons people try to replace current architectures
transformers are shallow. The causal path from an input token to an output token inside the model can maximally by O(transformer depth). So if the model wants to do complicated reasoning that requires more than O(depth) steps, it needs to cache intermediate results as tokens. But tokens are a very low information compared to embeddings, so this is a huge information bottleneck. Plausibly removing this would unlock a lot of capability in the model
Tokens are also fundamentally discrete, so they clip gradients. This makes RL training less efficient. ie most RL on getting the right answer for a problem will upweight probability of every token in trajectory, the reasoning being that on average “productive tokens” are overrrepresented in samples with the right answer, but its still the case that most of the tokens might be useless. But with a fully reccurent architecture, you could just upweight the probability of the final answer, and it would backpropagate and reward the intermediate steps according to their usefulness in a precise way.
Current architectures get slower with more context. Eg, attention is O(n) in a pretty fundamental way, because its looking over all past embeddings. But with recurrent architectures, past information is stored in a contant-size embedding, so inference is always O(1) per step.
B) reasons replacing shallow architectures is hard
Any fully recurrent architecture like this is not parallelizeable. You need to fully process token 1 before you can start processing token 2. With standard transformers you can do 1 step of processing on all tokens in the whole context independently, so its highly parallelizeable. Efficiency of training depends on this.
Crucially, during training, you also need to store past activations, so A3 is not that big an advantage here. Like during inference you have O(1) time and O(1) memory because you can forget hidden[1], hidden[2]… because hidden[n+1] depends just on hidden[n] (and input). But during training you need to backpropagate thru all of those, so you need to store them, and you get O(n) memory.
Also crucially, this is the other side of A1. Which means it might be fundamentally hard to fix A1 without incurring the cost of B1.
On top of training being computationally inefficient, it might be unstable and hard to make work. Like with 10000 tokens and 100 layers you need to backpropagate through a computational graph with paths a million in length.
I’m curious whether these points are correct and whether theres other important points I’m missing. Ideally someone has written a comprehensive article they can just link that addresses all of this.
Question/Worry:
Lastly, a worry occurred to me, which is that: the reason recurrent architectures are slow is in large part because of the parallelization thing (A1). However, the parallelization is mostly an advantage during pre-training. Because here you don’t have to do any multi-step generation.
However, during RL where you do thousand-token CoT full generations, this is not the case. In this regime, fully recurrent and shallow architectures should have similar performance profiles.
So, when agentic RL is scaled up and eventually becomes over 50% of training (which people seem to think will happen quite soon, with grok 4 maybe it already happened) the primary advantage of shallow architectures B1 falls away. (training cost is dominated by a regime where they’re equally performant) But the advantages of recurrent architectures remain all the same.
So this makes me think that scaling up RL makes people switching to recurrent architectures far more likely.
What are peoples thoughts on this argument/worry?
test
Collecting all the virtues!
Hey, I’m buying pico lightcones. And I am being deduced from my card, but I don’t get any pico lightcones @habryka
[Comment moved here for visibility by the community.]